STAT 405 Homework_2

txt

School

University of Pennsylvania *

*We aren’t endorsed by this school

Course

405

Subject

Statistics

Date

Feb 20, 2024

Type

txt

Pages

8

Uploaded by BaronNarwhal3261

Report
#### STAT 4050/7050 Homework 2, Spring 2023 #### Name: #### Instructions: #### 1. If a question refers to a function not seen in class, #### use the help facility in R to learn about it. #### 2. Insert your answers and the R code you used to generate them, beneath each question. #### Even though for some questions you could find the answer "by inspection", for this homework you need #### to write code to get the answers, unless it is explicitly stated that no code is required. #### 3. When you are asked to print an answer, it is NOT ENOUGH to simply write a print statement, #### for example print(2*3) as an answer, you also need to #### cut and paste into this document the value(s) that is actually printed (i.e.,6). On the other hand, pasting #### long lines of code from the console when you are not asked to do so is unnecessary. Please ask if you have questions on this. #### 4. When submitting to Canvas, please change the file extension to .txt, otherwise it cannot be submitted. ## ------------------------------------------------------------------------ #### Q1. Linear regression (40 points) ### Use the hr.data data frame which you can read in directly from this URL (as in class) ### "http://mathmba.com/richardw/HR.csv" hr.data <- read.csv(file = "http://mathmba.com/richardw/HR.csv") head(hr.data) summary(hr.data) #a 8pts. First convert Gender to a factor variable # Then Using the hr.data data frame, create a data frame called "hr.data.train" consisting of all # rows of hr.data with an EmployeeID that's not a multiple of 2 # and a data frame called "hr.data.test" consisting of all rows with EmployeeID numbers # that are a multiple of 2. # (Hint: Refer back to how we calculated whether something was an odd or even #) # Print the dimensions of the hr.data.train and hr.data.test data frames hr.data$Gender <- factor(hr.data$Gender) train_rows <- hr.data$EmployeeId%%2 != 0 test_rows <- hr.data$EmployeeId%%2 == 0 hr.data.train <- hr.data[train_rows,] hr.data.test <- hr.data[test_rows,] dim(hr.data.train) dim(hr.data.test) # > dim(hr.data.train) # [1] 2824 6 # > dim(hr.data.test) # [1] 2823 6 #b 2pts. The test set accounts for what proportion of the total 'hr.data' data # frame rows? Include your code and print the solution.
prop_test <- nrow(hr.data.test)/nrow(hr.data) prop_test # [1] 0.4999115 #c 3pts. Create a new column 'FiftyK' which is TRUE if the salary of each row is > 50,000, FALSE otherwise # Display the the frequency of FiftyK in the hr.data.train dataset. That is, how many TRUEs and how many FALSEs # Include your code and result. hr.data.train$FiftyK <- hr.data.train$salary > 50000 table(hr.data.train$FiftyK) # FALSE TRUE # 1175 1649 #d 2pts. Create a new column 'UnderEightyK' which is FALSE if the salary on each row is > 80,000, TRUE otherwise # Display the the frequency of UnderEightyK in the hr.data.train dataset. Include your code and result. hr.data.train$UnderEightyK <- hr.data.train$salary < 80000 table(hr.data.train$UnderEightyK) # FALSE TRUE # 482 2342 #e 1pt. Let's call the set of employees that make at least $50,000 but no more than $80,000 the 'Middle' group. # How many employees fall into this subset of hr.data.train? Include your code and result. middle_group <- hr.data.train$FiftyK == TRUE & hr.data.train$UnderEightyK == TRUE sum(middle_group) # [1] 1167 #f 1pt. What is the median salary of the entire hr.data.train dataframe? median(hr.data.train$salary) # [1] 54574.62 #g 4pts. Using the hr.data.train data frame, run a linear regression of salary # against age and gender. Then store the resulting *output object* of the regression as a new object # Write code to extract and print the coefficient and significance level of Age. model <- lm(salary ~ age + Gender, data = hr.data.train) summary(model) summary(model)$coefficients[2,1] summary(model)$coefficients[2,4] # > summary(model)$coefficients[2,1] # [1] 623.1717- Coefficient of Age # > summary(model)$coefficients[2,4] # [1] 3.454659e-42- Significance Level of Age #h 4pts. Write the code that would run a linear regression of salary using each variable in the hr.data.train set. # Do you get an error when running the code, yes or no? What effect does including an arbitrary identifier # such as 'EmployeeID' have on the model? model2 <- lm(salary ~ ., data = hr.data.train) model2 # Effect of arbitrary identifier Employee ID on model: since it is an arbitrary identifier it will practically not have a effect on the model since it is unrelated to the salary and has no meaningful relation with salary.
# Additionally, such identifiers complicate the model further, without adding any useful information. # No Error: # > model2 <- lm(salary ~ ., data = hr.data.train) # > model2 # # Call: # lm(formula = salary ~ ., data = hr.data.train) # # Coefficients: # (Intercept) EmployeeId age volturn GenderMale # 75370.334 -1.075 -4.195 -476.384 -1112.585 # npjc FiftyKTRUE UnderEightyKTRUE # -732.915 23473.668 -32533.656 # #i 5pts. Using the model that you fit in 1g, write code to predict the salary all rows in **hr.data.test**, which you also created # in 1a and store them in a vector named "predicted.salary". Print the first 5 observations in this dataset. # (Hint: ?predict) predicted.salary <- predict(model, newdata = hr.data.test) head(predicted.salary) # > head(predicted.salary) # 2 4 6 8 10 12 # 71223.55 62838.36 63461.54 62499.15 60629.63 58136.95 #j 10pts. We would like to evaluate the quality of predicted.salary # For each prediction above, update the "predicted.salary" object by adding the 95% confidence intervals. # Now using this dataframe, evaluate the proportion of times when the confidence interval did NOT contain # the actual value of the Salary from hr.data.test. # Note: this question asks you for the confidence interval not prediction interval (you do not need to know the difference # but please make sure to use the appropriate one) predicted.salary <- predict(model, newdata = hr.data.test, interval = "confidence") predicted.salary # > predicted.salary # fit lwr upr # 2 71223.55 68941.82 73505.29 # 4 62838.36 61615.90 64060.83 # 6 63461.54 62192.42 64730.66 # 8 62499.15 61116.56 63881.74 # 10 60629.63 59363.47 61895.80 # 12 58136.95 56949.40 59324.50 # 14 60006.46 58768.90 61244.02 # 16 64991.84 63397.77 66585.91 predicted.df <- data.frame(predicted.salary) predicted.df$lwr <- predicted.salary[, 2] predicted.df$upr <- predicted.salary[, 3] predicted.df$actual.salary <- hr.data.test$salary predicted.df$answer <- (predicted.df$actual.salary >= predicted.df$lwr) & (predicted.df$actual.salary <= predicted.df$upr) predicted.df$answer mean(predicted.df$answer)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# [1] 0.04463337 this means 4.46% are True and 95.54% are FALSE prop.outside.ci <- 1 - mean(predicted.df$answer) prop.outside.ci # [1] 0.9553666 - ANSWER ##### Q2. 20 points #### In statistics, Kurtosis is a statistical measure that defines how heavily the tails #### of a distribution differ from the tails of a normal distribution. #### In other words, kurtosis identifies whether the tails of a given distribution contain extreme values. #a. 10pts. Write a function in R that takes a single numeric vector as argument, and finds the kurtosis of the vector. # The definition of kurtosis we will use is at the URL: # https://cdn.educba.com/academy/wp-content/uploads/2019/12/Kurtosis- Formula.jpg.webp. # Call your function "kurtosis" and cut and paste your function here. # As a test case kurtosis(c(1:10)) should result in 1.946844 kurtosis <- function(x) { n <- length(x) m <- mean(x) diff_mean_x <- x - m num <- sum(diff_mean_x^4) diff_2 <- x - (m^2) denom <- sum(diff_2^2) kurt <- (n*num) /(denom) return(kurt) } kurtosis(c(1:10)) # > kurtosis(c(1:10)) # [1] 1.946844 #b. 5pts. Use your kurtosis function on the following vector and report the result: twoB.vec <- c(0.238, 0.190, 9.631, -2.265, 0.365, 0.712, -0.392, -0.884, 1.111, 1.137) kurtosis(twoB.vec) # > kurtosis(twoB.vec) # [1] 618.8587 #c. 5pts. Refine your kurtosis function, so that if it is called with anything other than a numeric argument # it stops and produces an informative error message saying "This function is only meant to work on numbers!" # Use the "if", "is.numeric" and "stop" functions to achieve this goal. # Cut and paste your refined function here and show that the function works as intended by passing the following vector as input # kurtosis(c("Joanne","Yash")) kurtosis <- function(x) { if(!is.numeric(x)) { stop("This function is only meant to work on numbers!") } n <- length(x) m <- mean(x) diff_mean_x <- x - m num <- sum(diff_mean_x^4) diff_2 <- x - (m^2) denom <- sum(diff_2^2) kurt <- (n*num) /(denom)
return(kurt) } kurtosis(c("Joanne", "Yash")) # > kurtosis(c("Joanne", "Yash")) # Error in kurtosis(c("Joanne", "Yash")) : # This function is only meant to work on numbers! ##### Q3. 20 points # It is very common to need to combine data sources into a single analysis dataset for a variety # of reasons. In this problem we will hone this skill: #a. 3pts. Load the 'students.csv' and 'schools.csv' files from the HW4 folder in Canvas into R. # Name each dataset 'students' and 'schools' respectively. The first dataset contains information # on student demographics and test scores, while the second dataset contains information on # school names and average test scores. # note: read.csv() and setwd() will be useful #setwd not required since pathway is already known (windows laptop) schools <- read.csv(file = "C:\\Users\\Varyam\\Downloads\\schools.csv") students <- read.csv(file = "C:\\Users\\Varyam\\Downloads\\students.csv") #b. 2pts. Inspect the first six rows of each dataset. Print the output below. head(students) # student_id gender race SES academic_interest math_score reading_score # 1 1 F White Low High 70 85 # 2 2 M Black High Low 60 75 # 3 3 F Hispanic Middle High 80 90 # 4 4 M White High Middle 85 80 # 5 5 F Black Low High 75 80 # 6 6 M Hispanic Middle Middle 90 95 # head(schools) # id school_name avg_math_score avg_reading_score # 1 1 East High 78 82 # 2 2 West High 80 85 # 3 3 South High 85 87 # 4 4 North High 83 86 # 5 5 Central High 79 80 # 6 6 West High 80 85 #c. 4pts. Rename the "student_id" column in the 'students.csv' dataset to "id" names(students)[1] <- "id" head(students) # id gender race SES academic_interest math_score reading_score # 1 1 F White Low High 70 85 # 2 2 M Black High Low 60 75 # 3 3 F Hispanic Middle High 80 90 # 4 4 M White High Middle 85 80 # 5 5 F Black Low High 75 80 # 6 6 M Hispanic Middle Middle 90 95 # > #d. 6pts. Merge the two datasets into a single data frame using the "merge()" function,
# with "id" as the common column. Print the first six rows of the merged dataset. merged <- merge(students, schools, by = "id") head(merged) # > head(merged) # id gender race SES academic_interest math_score reading_score school_name # 1 1 F White Low High 70 85 East High # 2 2 M Black High Low 60 75 West High # 3 3 F Hispanic Middle High 80 90 South High # 4 4 M White High Middle 85 80 North High # 5 5 F Black Low High 75 80 Central High # 6 6 M Hispanic Middle Middle 90 95 West High # avg_math_score avg_reading_score # 1 78 82 # 2 80 85 # 3 85 87 # 4 83 86 # 5 79 80 # 6 80 85 #e. 5pts. Run a regression on the merged dataset to predict a student's math score based on their reading score. # Print out the summary of the regression results. regmodel <- lm(avg_math_score ~ avg_reading_score, data = merged) summary(regmodel) # > summary(regmodel) # # Call: # lm(formula = avg_math_score ~ avg_reading_score, data = merged) # # Residuals: # Min 1Q Median 3Q Max # -2.03644 -1.26856 0.04094 1.11831 1.96356 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 3.6131 8.6662 0.417 0.681 # avg_reading_score 0.9226 0.1032 8.939 6.07e-09 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 1.376 on 23 degrees of freedom # Multiple R-squared: 0.7765, Adjusted R-squared: 0.7668 # F-statistic: 79.9 on 1 and 23 DF, p-value: 6.072e-09 ##### Q4. 20 points # You are a leading European football (i.e. soccer) statistician working for your favorite childhood # club, AFC Richmond. You have three star goal scorers that go by the name 'Sam,' 'Dani,' and 'Jamie' who # wear the numbers 24, 14, and 9 respectively. The average number of goals scored per game by Sam is 0.9,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# compared to 1.0 for Dani, and 1.2 for Jamie. The remaining twenty-two players on the team average a # combined .7 goals per game. Assume that each player's goal scoring ability follow an independent Poisson # distribution. You and your colleagues have determined that your team must score a certain total number # of goals in any given game in order to win that game. # a. 10pts. Write a R function called "my.vec.goals.calc" that takes in the following inputs: # 1. a dataframe with 'S24', 'D14, 'J09', 'O22' representing the number of goals score by # Sam, Dani, Jamie and the Other 22 players on the roster respectively in a given game # 2. a number (call it "min_G") for the minimum number of goals your club must score to win a given game. # It should output a probability of your team winning the game, based on the dataframe and the min_G # threshold. # For example, if you input this data frame into the function along with a min_G value = 3, # S24 D14 J09 O22 # 0 1 1 1 # 1 1 2 0 # 1 1 1 0 # 0 0 2 1 # you should get a probability of 0.25 as output, since only 1 of the 4 rows (Row 2) has a total goal # score > min_G. Notice that all other rows have a row total of <= 3. my.vec.goals.calc <- function(df, min_G) { df$row_sum <- apply(df, 1, sum) prob_win <- sum(df$row_sum > min_G) / nrow(df) return(prob_win) } # b. 5pts. Set seed to 20230101. Create a data frame called my.vec.data.goals with 10000 replicates of goal # counts for each player group (i.e. column) next game. Do this for all 4 columns. This data frame should have a # dimension of 10000 x 4 and the column names should be the same as before. set.seed(20230101) S_mean <- 0.9 D_mean <- 1.0 J_mean <- 1.2 O_mean <- 0.7 my.vec.data.goals <- data.frame( S24 = rpois(10000, S_mean), D14 = rpois(10000, D_mean), J09 = rpois(10000, J_mean), O22 = rpois(10000, O_mean)) my.vec.data.goals
# c. 10pts. Using this data, estimate the probability of your team scoring less than or equal to: # i. 4 goals # ii. 2 goals # Hint: if X and Y are binomial then X = 1 - Y 1-(my.vec.goals.calc(my.vec.data.goals,4)) # [1] 0.6598 1-(my.vec.goals.calc(my.vec.data.goals,2)) # [1] 0.2601 # d. 5pts. Print the probability of scoring less than the given number of goals in a formatted statement # using the paste command. # Eg: for question 3ci, the output should say "AFC Richmond has a 0.xxx chance of scoring less than or # equal to 4 goals in their next matchup. prob4 <- 1-(my.vec.goals.calc(my.vec.data.goals,4)) prob2 <- 1-(my.vec.goals.calc(my.vec.data.goals,2)) paste("AFC Richmond has a", round(prob4, 4), "chance of scoring less than or equal to 4 goals in their next matchup.") paste("AFC Richmond has a", round(prob2, 4), "chance of scoring less than or equal to 2 goals in their next matchup.") # > paste("AFC Richmond has a", round(prob4, 4), "chance of scoring less than or equal to 4 goals in their next matchup.") # [1] "AFC Richmond has a 0.6598 chance of scoring less than or equal to 4 goals in their next matchup." # > paste("AFC Richmond has a", round(prob2, 4), "chance of scoring less than or equal to 2 goals in their next matchup.") # [1] "AFC Richmond has a 0.2601 chance of scoring less than or equal to 2 goals in their next matchup.