Lab2_Key_SP23

pdf

School

Irvine Valley College *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

8

Uploaded by ElderScorpion4198

Report
Lab 2 (Spring 2023) Prof Cat. 2023-02-10 Problem 1 Download the “interruption” data from Canvas, and import this data into R. This dataset has two variables that describe the number of interruptions a person counted before we defined what an interruption was (int1) and after we came up with a shared definition (int2). Report the mean, range, and standard deviation for both variables. Then, use R to calculate the standard deviation of each variable “by hand” (watch videos from Lecture 3 readings to see how to do this). Finally, explain how the mean, range, and standard deviation changed after we operationalized what an interruption is (e.g., compare these statistics for int1 and int2). First I’ll load and check the dataset. int <- read.csv( "~/Dropbox/Teaching Datasets/interruptions_SP23.csv" ) # loading data head(int) # checking to make sure the data loaded correctly. ## int1 int2 ## 1 25 25 ## 2 30 50 ## 3 29 12 ## 4 29 12 ## 5 24 9 ## 6 25 17 Next, graphing the variables. par( mfrow = c( 1 , 2 )) # using the par function to split the graphics window hist(int$int1) hist(int$int2) 1
Histogram of int$int1 int$int1 Frequency 0 10 30 50 0 10 20 30 40 50 Histogram of int$int2 int$int2 Frequency 0 10 20 30 40 50 0 10 20 30 40 50 60 70 describe(int) # using the describe function; from the ' psych ' package. you can use other methods to do t ## vars n mean sd median trimmed mad min max range skew kurtosis se ## int1 1 197 21.94 10.52 20 21.25 7.41 1 57 56 0.77 0.99 0.75 ## int2 2 197 15.02 6.06 15 14.64 4.45 1 50 49 1.60 7.20 0.43 Okay, now I’ll calculate the SD by hand: resI1 <- int$int1 - mean(int$int1, na.rm = T) # residuals for int 1 resI2 <- int$int2 - mean(int$int2, na.rm = T) # residuals for int 2 SSE1 <- sum(resI1ˆ 2 ) # SSE for int1 SSE2 <- sum(resI2ˆ 2 ) # SSE for int2 nI1 <- length(na.omit(int$int1)) # sample size for int1, omitting missing data nI2 <- length(na.omit(int$int2)) # sample size for int2, omitting missing data sqrt(SSE1/(nI1 -1 )) # sd for int1...same thing R got using the describe function ## [1] NA sqrt(SSE2/(nI2 -1 )) # sd for int2...same thing R got using the describe function ## [1] NA 2
Note that student estimates might be off if they did not a) remove missing data from the sample size, and b) subtract n-1 in the equation. This is fine (no penalty)! The key idea is that the SD is an average of the SSE. So, both the mean and standard deviation of the number of interruptions went down after we operationalized an interruption. This makes sense because our definition a) clarified what an interruption was (ensuring that people’s answers would be more similar to each other = less variation = lower SD) and b) focused on counting the number of times one individual was interrupted (whereas before the operationalization, some students were counting interruptions from both people). Note : I didn’t remove outliers here; student answers might be slightly different if they did. Problem 2 Download the “cal_mini_data_SP23.csv” dataset from bCourses. Load the data into R, check to make sure it loaded correctly, and report the sample size. mini <- read.csv( "~/Dropbox/Teaching Datasets/cal_mini_data_SP23.csv" , stringsAsFactors = T) head(mini) ## fb.friends insta.followers insta.follow bored thirsty tired satisfied ## 1 3 1477 1815 4 7 7 7 ## 2 0 945 1001 3 8 9 8 ## 3 802 561 571 5 6 10 9 ## 4 69 134 149 4 7 7 7 ## 5 0 285 882 3 3 5 6 ## 6 0 1200 667 7 4 10 8 ## oski.love r.love socmeduse data.pow corp.pow hard.work privilege catdogpref ## 1 3 6 7 2 8 8 6 cats ## 2 5 3 8 8 9 8 7 dogs ## 3 4 7 8 5 8 5 6 cats ## 4 4 7 5 3 8 8 8 dogs ## 5 10 4 10 10 10 8 8 dogs ## 6 4 1 1 10 6 10 9 dogs ## tuhoburapref calsports is.female long.hair have.water shoe.size height ## 1 horses Yes Yes Yes Yes 8.0 64 ## 2 horses Yes Yes Yes Yes 9.5 70 ## 3 horses Yes Yes Yes Yes 6.0 64 ## 4 horses No Yes No No 7.0 63 ## 5 rats Yes Yes Yes No 8.5 66 ## 6 butterflies Yes Yes Yes Yes 8.0 68 nrow(mini) ## [1] 123 Choose one continuous (numeric) variable from this dataset. Graph the variable using the hist() function, and use arguments to change the color of this graph, the labels of this graph, and the title of the graph. Paste this graph into your lab. Below the histogram, use R to report the mean, median, standard deviation, and range for this variable. Then describe the shape of this distribution (e.g., normal / skew / kurtosis) and what you learn about our class from this graph. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
names(mini) ## [1] "fb.friends" "insta.followers" "insta.follow" "bored" ## [5] "thirsty" "tired" "satisfied" "oski.love" ## [9] "r.love" "socmeduse" "data.pow" "corp.pow" ## [13] "hard.work" "privilege" "catdogpref" "tuhoburapref" ## [17] "calsports" "is.female" "long.hair" "have.water" ## [21] "shoe.size" "height" hist(mini$corp.pow, main = "A Histogram" , xlab = "Perceptions of Corporate Power Over User Data" , col = "black" , bor = "white" ) A Histogram Perceptions of Corporate Power Over User Data Frequency 2 4 6 8 10 0 5 10 15 20 25 30 35 describe(mini$corp.pow) ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 123 7.93 1.75 8 8.15 1.48 2 10 8 -1.04 1.29 0.16 The graph is very negatively / left skewed - this means the majority of students believe that corporations hold power over our user data. Now, choose a categorical (non-numeric) variable from this dataset. Graph the variable using the plot() function, and use the summary() function to report the number of people in each group. Below the graph, describe what you learn about our class from the graph. 4
plot(mini$tuhoburapref) butterflies horses rats turtles 0 10 20 30 40 summary(mini$tuhoburapref) ## butterflies horses rats turtles ## 31 44 15 33 Our class is a fan of horses and does not appear to like rats very much. Problem 3 Load the covid_behvaior_data.csv dataset into R, check to make sure it loaded correctly, and report the sample size. covid <- read.csv( "~/Dropbox/Teaching Datasets/covid_behavior_data.csv" , stringsAsFactors = T) head(covid) ## age Handwash Mask Sanitize SocialDistance SelfIsolate gender ethnicity ## 1 NA NA NA NA NA NA <NA> <NA> ## 2 41 NA NA NA NA NA M W ## 3 52 4 4 4 4 4 W W ## 4 60 NA NA NA NA NA M W ## 5 39 NA NA NA NA NA W W 5
## 6 28 1 3 0 2 0 M W ## political_party EXTRA AGREE CONSC NEGEM OPENN ## 1 <NA> NA NA NA NA NA ## 2 R NA NA NA NA NA ## 3 R 0.3333333 4.0000000 3.333333 2.666667 4 ## 4 R NA NA NA NA NA ## 5 R NA NA NA NA NA ## 6 R 2.3333333 0.3333333 2.000000 1.000000 1 nrow(covid) ## [1] 842 Looks like the data loaded correctly. Choose one continuous (numeric) variable from the dataset and one categorical variable from the dataset - use the codebook as a guide for what these variables mean. Graph each variable. Report descriptive statistics for the continuous variable and frequency of the levels for the categorical variable. Then, describe what you learn about the people in this dataset from each graph. par( mfrow = c( 1 , 2 )) # splitting my graphics window into a 1x2 grid hist(covid$CONSC) plot(covid$ethnicity) Histogram of covid$CONSC covid$CONSC Frequency 0 1 2 3 4 0 50 100 150 AA EA O W 0 100 300 500 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
library(psych) # I installed this package in a previous R sesh. describe(covid$CONSC) ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 405 2.96 0.93 3 3.06 0.99 0.33 4 3.67 -0.7 -0.42 0.05 summary(covid$CONSC) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s ## 0.3333 2.3333 3.0000 2.9646 3.6667 4.0000 437 Participants seem very skewed in terms of their conscientiousness; I would have expected something looking more normal given what I know about personality differences (which tend to be normal). This makes me think there’s something weird going on in the data (maybe the data are incorrect?) or the sample (maybe there’s something different about these people that make them super conscientious. . . ). Participants also do not seem reprentative of the ethnic breakdown of the US; lots of white people in the data, not many people of color, latino/a/x people not represented (or lumped into some other category??) Seems problematic if we want to use these data to make a claim about what people in general are like. Tho TBH this is fairly standard in psychological research; will talk more about this in a few weeks so stay tuned! Problem 4 I’m not going to do this for the lab key. But cool that you did! Problem 5 OPTIONAL CHALLENGE : Use Google to find a way to calculate the mode in R. Check that your method works by defining two variables in R - one with a set of numbers that has one mode, and one with a set of numbers that has two modes. [For example : variable1 <- c(1, 1, 2) has two modes]. Use the mode function on each variable to confirm that the mode function works. Screenshot your code and output to calculate and test the mode, with a link to where you found the code. I found the following code by googling “calculate mode in R” : https://stackoverflow.com/questions/2547402/ how-to-find-the-statistical-mode/8189441#8189441 Modes <- function (x) { ux <- unique(x) tab <- tabulate(match(x, ux)) ux[tab == max(tab)] } I’ll test the code to see if it works. modetest1 <- c( 1 , 1 , 1 , 3 , 4 , 6 , 42 ) modetest2 <- c( 1 , 1 , 1 , 3 , 4 , 6 , 42 , 42 , 42 ) Modes(modetest1) ## [1] 1 7
Modes(modetest2) ## [1] 1 42 It works. Yay. 8