STAT3032_HW9_Section001_S2023_Solution

docx

School

University of Minnesota-Twin Cities *

*We aren’t endorsed by this school

Course

3032

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

10

Uploaded by JudgeOxide10008

Report
STAT 3032 Regression and Correlated Data Homework 9 Please show your work on each problem for full credit. A correct answer, unsupported by the necessary explanation , R code or output will receive very little if any credit. Your work needs to be organized in a reasonably neat and coherent way, and submitted as a pdf file on Canvas. Please do not share this handout outside the class. Problem 1 On April 15, 1912, during her maiden voyage, the ship Titanic sank after colliding with an iceberg, killing many passengers and crew. Here, we will use a subset of the data to analyze the survival rates for different groups of people. Please download the dataset TitanicPartial_v2.csv from Canvas and work through the following questions. Variables used in this analysis: Survival: survival status. 1 = survived, 0 = did not survive Pclass: passenger class, 1= First, 2 = Second, 3 = Third Age: age in years. SibSp: number of siblings/spouses aboard (a)_Explore the data. How many passengers are included in the dataset? How many of them survived and how many of them did not survive? Please explain how you obtain the answers. > dat = read.csv('TitanicPartial_v2.csv') > table(dat$Survival) 0 1 424 290 424 passengers did not survive; 290 passengers survived; the total number of passengers is 714 (=290 + 424). Alternatively , you can use the following code to obtain the answers. > nrow(dat) [1] 714 > sum(dat$Survival) [1] 290 > 714 - 290
STAT 3032 Regression and Correlated Data [1] 424 424 passengers did not survive; 290 passengers survived; the total number of passengers is 714 (=290 + 424). (b)_See below for the scatterplots of the survival status of the passengers in the different classes. Each dot represents a passenger. Based on the scatterplots only , which passenger class has the lowest odds for survival for those who are in the age group 40-50 ? Please explain your answer. Hint: look at the relative number of survival and non-survival in each group. For a particular group of passengers, Theestimated odds of survival = The proportion of survivors The proportion of the deceased In the red box (age group 40-50), we can see that the proportion of survivors is lower in the third class than any other. So the third class has the lowest odds of survival. (c)_Fit the following regression model using a suitable generalized linear model and provide the model summary. Explain your choice of GLM
STAT 3032 Regression and Correlated Data mod1: Survival ~ 1 + as.factor(Pclass)+ Age We will use a logistic regression model. The reason for this choice is that the response is binary (survive or not survive) and we are interested in probability/odds of survival. > mod1 = glm(Survival ~ 1 + as.factor(Pclass)+ Age, data=dat, family=binomial) > summary(mod1) Call: glm(formula = Survival ~ 1 + as.factor(Pclass) + Age, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.1524 -0.8466 -0.6083 1.0031 2.3929 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.296012 0.317629 7.229 4.88e-13 *** as.factor(Pclass)2 -1.137533 0.237578 -4.788 1.68e-06 *** as.factor(Pclass)3 -2.469561 0.240182 -10.282 < 2e-16 *** Age -0.041755 0.006736 -6.198 5.70e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 827.16 on 710 degrees of freedom AIC: 835.16 Number of Fisher Scoring iterations: 4 (d)_What happens if we don’t apply the as.factor( ) function to Pclass ? Try fitting mod1 without as.factor( ) . You can call this new model mod2 . How do the summary outputs of mod1 and mod2 differ? Hint: how many slope(s) is/are associated with Pclass when you don’t use as.factor( ) ? > mod2 = glm(Survival ~ 1 + Pclass + Age, data=dat, family=binomial) > summary(mod2)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data Call: glm(formula = Survival ~ 1 + Pclass + Age, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.1712 -0.8550 -0.6136 1.0127 2.3883 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.585448 0.406761 8.815 < 2e-16 *** Pclass -1.243853 0.119060 -10.447 < 2e-16 *** Age -0.042006 0.006725 -6.246 4.2e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 827.43 on 711 degrees of freedom AIC: 833.43 Number of Fisher Scoring iterations: 4 Only 1 slope is associated with Pclass when I don’t use as.factor( ) . Whereas in Part (c), Pclass has 2 slopes for its dummy variables. Without the as.factor( ) , R is treating Pclass as a quantitative variable. (e)_Interpret the slope of Age in mod1 in context. Solution 1: (Note solutions 2 and 3 are preferred but not required) Controlling for the passenger class, when the age increases by 1 year, the log of the odds of survival decreases by 0.0418. > exp(-0.041755)-1 [1] -0.04089527 Solution 2: Controlling for the passenger class, when the age increases by 1 year, the odds of survival decreases by 4.1%. Solution 3:
STAT 3032 Regression and Correlated Data Controlling for the passenger class, when the age increases by 1 year, the odds of survival decreases by a multiplier of 0.96 (= e 0.041755 ). (f)_Based on mod1 , which passenger class has the highest probability for survival for passengers of the same age? Which class has the lowest probability for survival for passengers of the same age? Please explain your answer. After controlling for age, the first class has the highest probability of survival. The third class has the lowest probability of survival. Explanation: Based on the model summary in Part (c), ^ log ( odds )= 2.296 1.138 as.factor ( Pclass ) 2 2.470 as.factor ( Pclass ) 3 0.042 Age Let ^ log ¿¿ , ^ log ¿¿ , and ^ log ( odds ) 3 be the estimated log odds of survival for first class, second class, and third class passengers. log ¿ log ¿ log ¿ Therefore, when age is the same, log ¿ . Since the probability follows the same order as the log odds, we know that the First class passengers have the highest probability of survival and Third class passengers have the lowest probability of survival. (g)_Fit the following regression model (using the same type of GLM as mod1) and provide the model summary. mod3: Survival ~ 1 + as.factor(Pclass) * Age > mod3 = glm(Survival ~ 1 + as.factor(Pclass)*Age, data=dat, family=binomial) > summary(mod3) Call: glm(formula = Survival ~ 1 + as.factor(Pclass) * Age, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.1310 -0.8475 -0.6070 1.0001 2.3994
STAT 3032 Regression and Correlated Data Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.24252 0.49128 4.565 5.00e-06 *** as.factor(Pclass)2 -1.05325 0.63260 -1.665 0.095921 . as.factor(Pclass)3 -2.40716 0.56489 -4.261 2.03e-05 *** Age -0.04044 0.01143 -3.537 0.000405 *** as.factor(Pclass)2:Age -0.00236 0.01687 -0.140 0.888748 as.factor(Pclass)3:Age -0.00172 0.01605 -0.107 0.914654 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 827.14 on 708 degrees of freedom AIC: 839.14 Number of Fisher Scoring iterations: 4 (h)_In the 1997 movie Titanic , the main character Jack Dawson was a 20-year-old male passenger in the 3rd class. Please predict his probability of survival based on mod3. > predict(mod3, newdata = data.frame(Age=20, Pclass =3), type="response") 1 0.2674082 Jack’s estimated probability of survival is 0.267. You can also use the equation of the model: Let P2 represent as.factor(Pclass)2 and P3 represent as.factor(Pclass)3. ^ Survival = 1 1 + exp (−( 2.24252 1.05325 P 2 2.40716 P 3 0.04044 Age 0.00236 P 2 × Age 0.00172 P 3 × Age ) ¿ 1 1 + exp (−( 2.24252 2.40716 0.04044 × 20 0.00172 × 1 × 20 )) = 0.267 > 1/(1+exp(-(2.24252-2.40716-0.04044*20-0.00172*1*20))) [1] 0.2674028
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data (i)_In the 1997 movie Titanic , Jack Dawson eventually didn’t survive. Is the prediction in Part (b) consistent with the truth of what happened to Jack? The estimated probability of Jack’s survival is 0.267, which is less than 0.5. This means that he was more likely to die than survive. This low probability of survival is consistent with the reality that Jack did not survive. (j)_Based on mod3 , does class affect the impact of age on passenger survival ? Please explain your answer. The best way to answer this question is by using the chi-square test done in the next part, and the answer will be no due to the p-value being very large (.989). Suppose the β 4 and β 5 are the slopes of the interaction terms in mod3. H 0 : β 4 = β 5 = 0 H A : at least one of β 4 and β 5 is not 0. OR H 0 : mod1 H A : mod3 The test statistic follows the χ 2 2 distribution (the chi-squared distribution with 2 degrees of freedom) As an alternative, the t-tests could be used but this is potentially misleading as sometimes all the t-tests are not significant but the F-test will be. (k)_Use the chi-squared test to compare mod1 (fitted in Part c) and mod3 (fitted in Part g): mod1: Survived ~ 1 + as.factor(Pclass)+ Age mod3: Survived ~ 1 + as.factor(Pclass)* Age Please note that in mod3 , the term as.factor(Pclass)* Age is equivalent to as.factor(Pclass) + Age + as.factor(Pclass):Age. Conduct this test noting the null and alternative hypotheses. Interpret your result with a significance level of .05.
STAT 3032 Regression and Correlated Data > anova(mod1, mod3, test='Chisq') Analysis of Deviance Table Model 1: Survival ~ 1 + as.factor(Pclass) + Age Model 2: Survival ~ 1 + as.factor(Pclass) * Age Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 710 827.16 2 708 827.14 2 0.021596 0.9893 Since the p value (0.9893) is large, we prefer the simple model ( mod1 ). More completely: Since the p-value is larger than the significance level, we do not have sufficient evidence to conclude that class affects the impact of age on passenger survival. Problem 2 We will now use the same data, but predict the number of siblings/spouses the passengers have. (a)_ Fit the following regression model using a suitable generalized linear model and provide the model summary. Justify your choice of GLM mod4: SibSp ~ 1 + as.factor(Pclass)+ Age We should choose a poisson regression model because our response is counts and it is not out of some known number of trials. > mod4<-glm(SibSp~1+as.factor(Pclass)+Age, family="poisson", data=TitanicPartial_v2) > summary(mod4) Call: glm(formula = SibSp ~ 1 + as.factor(Pclass) + Age, family = "poisson", data = TitanicPartial_v2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.730628 0.163692 4.463 8.07e-06 *** as.factor(Pclass)2 -0.415390 0.162645 -2.554 0.0107 * as.factor(Pclass)3 -0.269089 0.136201 -1.976 0.0482 * Age -0.045604 0.004222 -10.800 < 2e-16 ***
STAT 3032 Regression and Correlated Data --- (b)_Interpret the slope/coefficient of age from mod4 in context. Interpretations of poisson regression will be similar to those of a log transform of the response. One solution is: Controlling for passenger class, an increase in 1 to the passenger age is associated with a decrease in the log of the average number of siblings and spouses by 0.0456. (bonus 2pts)_Plot SibSp by Age coloring by passenger class. Include the fit lines from mod4 with colors corresponding to the class they model. There are a lot of ways to do this. One simple but rather inelegant approach is given below. > plot(jitter(SibSp)~Age, col=Pclass ,data=TitanicPartial_v2, pch=16) > x<-0:80 > lines(exp(0.730628-0.045604*x)~x, col=1, lwd=3) > lines(exp(0.730628-0.045604*x-0.415390)~x, col=2, lwd=3) > lines(exp(0.730628-0.045604*x-0.269)~x, col=3, lwd=3)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data