STAT3032_004_HW6_S2023_Solution (Shared)

docx

School

University of Minnesota-Twin Cities *

*We aren’t endorsed by this school

Course

3032

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

6

Uploaded by JudgeOxide10008

Report
STAT 3032 Regression and Correlated Data Homework 6 (Solution) Please show your work on each problem for full credit. A correct answer, unsupported by the necessary explanation , R code or output will receive very little if any credit. Your work needs to be organized in a reasonably neat and coherent way, and submitted as a pdf file on Canvas. Please do not share this handout outside the class. Problem 1 The Current Population Survey (CPS) is used to supplement census information between census years. The data file cps1985.csv contains a random sample of 534 persons from the CPS data collected in 1985, with information on wages and other characteristics of the workers. The variables we will use in the analyses are listed below: wage Wage (dollars per hour). age The age of the worker union Whether the worker has union membership. The possible values are “Yes” and “No”. Download the cps1985.csv data file from Canvas. Import the dataset into R and answer the following questions. (a)_[1 pt] Fit Model A ( wage ~ 1 + age + union) and provide the model summary. Solution: > modA = lm(wage~1+age+union,data = cps1985) > summary(modA) Call: lm(formula = wage ~ 1 + age + union, data = cps1985) Residuals: Min 1Q Median 3Q Max -8.862 -3.352 -1.187 1.977 36.929 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.09967 0.71624 8.516 < 2e-16 *** age 0.07009 0.01866 3.757 0.000191 *** unionYes 1.90745 0.56921 3.351 0.000862 *** ---
STAT 3032 Regression and Correlated Data Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.015 on 531 degrees of freedom Multiple R-squared: 0.05138, Adjusted R-squared: 0.04781 F-statistic: 14.38 on 2 and 531 DF, p-value: 8.282e-07 (b)_[2 pts] Interpret the slope of age in Model A in context. Solution: Holding the union membership constant, for every 1 year increase in age, the wage increases by 0.07009 dollars per hour on average. (c)_[2 pts] Interpret the slope of unionYes in Model A in context. Solution: Holding age constant, when compared to those without the union membership, the workers who have union membership earn 1.90745 dollars per hour more on average. Alternatively , Holding the age constant, the wage increases by 1.90745 dollars per hour on average when switching from not being in the union to being in the union. (d)_[1 pt] Fit the Model B ( wage ~ 1 + age + union + age:union) and provide the model summary. Solution: > modB = lm(wage~1+age+union+age:union,data = cps1985) > summary(m2) Call: lm(formula = wage ~ 1 + age + union + age:union, data = cps1985) Residuals: Min 1Q Median 3Q Max -8.123 -3.342 -1.167 1.879 37.137 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.60183 0.78091 7.173 2.48e-12 *** age 0.08385 0.02055 4.081 5.18e-05 *** unionYes 4.93788 1.99130 2.480 0.0135 * age:unionYes -0.07736 0.04872 -1.588 0.1129 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.008 on 530 degrees of freedom
STAT 3032 Regression and Correlated Data Multiple R-squared: 0.05587, Adjusted R-squared: 0.05053 F-statistic: 10.45 on 3 and 530 DF, p-value: 1.084e-06 (e)_[2 pts] Based on Model B, write down the equations of the fitted models ( wage ~ 1 + age ) for the workers with union membership and without union membership . Does age have a larger impact on the wage for the workers with union membership or without union membership? Please explain your answer. Hint: Compare the slopes of age in these two equations. Solution: Denote the expected wage with and without union membership as ^ wage U and ^ wage N , respectively. Then ^ wage U =( 5.60183 + 4.93788 )+( 0.08385 0.07736 ) age = 10.53971 + 0.00649 age ^ wage N = 5.60183 + 0.08385 age Since the slope is larger and positive for those without union membership, age has a larger impact on wage for workers without union membership. (f)_[2 pts] Interpret the coefficient of age in Model B in context. Hint: consider the workers without the union membership. Solution: For every 1 year increase in age of a worker without union membership , the wage increases by 0.08385 dollars per hour on average. Problem 2 In this problem, we continue to use the cps1985.csv data. The variables we will use in the analyses are listed below: wage Wage (dollars per hour). exper Number of years of work experience. sector Worker sector. The values are “clerical”, “const”, “manag”, “manuf”, “other”, “prof”, “sales”, and “service”. (a)_[1 pt] Fit Model C that uses exper to predict wage and provide the model summary. Solution: > modC = lm(wage~1 + exper,data = cps1985) > summary(modC) Call:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data lm(formula = wage ~ 1 + exper, data = cps1985) Residuals: Min 1Q Median 3Q Max -8.247 -3.601 -1.111 2.332 36.084 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.37997 0.38895 21.545 <2e-16 *** exper 0.03614 0.01793 2.016 0.0443 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.124 on 532 degrees of freedom Multiple R-squared: 0.007579, Adjusted R-squared: 0.005714 F-statistic: 4.063 on 1 and 532 DF, p-value: 0.04433 (b)_[1 pt] Fit Model D that uses exper and sector (main effects only) to predict wage and provide the model summary. Solution: > modD = lm(wage~1+exper+sector,data = cps1985) > summary(modD) Call: lm(formula = wage ~ 1 + exper + sector, data = cps1985) Residuals: Min 1Q Median 3Q Max -12.011 -3.001 -0.945 1.924 32.679 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.51341 0.55243 11.791 < 2e-16 *** exper 0.05172 0.01643 3.147 0.00174 ** sectorconst 1.87394 1.14080 1.643 0.10105 sectormanag 5.25580 0.78286 6.714 4.94e-11 *** sectormanuf 0.49955 0.73441 0.680 0.49667 sectorother 1.19459 0.73445 1.627 0.10444 sectorprof 4.63452 0.65406 7.086 4.47e-12 *** sectorsales 0.12505 0.88767 0.141 0.88802 sectorservice -1.02039 0.69479 -1.469 0.14253 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.638 on 525 degrees of freedom Multiple R-squared: 0.1978, Adjusted R-squared: 0.1856 F-statistic: 16.18 on 8 and 525 DF, p-value: < 2.2e-16
STAT 3032 Regression and Correlated Data (c)_[1 pt] What would Model D look like in space? Please explain. Solution: Model D would be 8 parallel lines. Since the model contains one quantitative variable ( exper ), the shape will be a line. Since there is one categorical predictor variable ( sector ) with 8 levels, there will be 8 lines. Since there is no interaction, the lines are parallel to each other. (d)_[1 pt] What is the difference between the degrees of freedom of RSS in Model C and Model D? Your answer should be a number. Briefly explain why this is not 1. Solution: The difference between the degrees of freedom is 532 525 = 7 degrees of freedom. This is because the additional predictor variable in Model D ( sector ) requires 7 dummy variables and we need to use 7 degrees of freedom to estimate their slopes. (e)_[3 pts] Use the Partial F test to compare Model C and Model D. (i) [1 pt] What are the null and alternative hypotheses? Please define the parameters. (ii) [1 pt] What is the test statistic value? (iii) [1 pt] Based on the test result, do you prefer Model C and Model D? Please explain. You may use 0.05 as the significance level. Solution: (i) The hypotheses are H 0 : β 2 = β 3 = β 4 = β 5 = β 6 = β 7 = β 8 = 0 H A : at least 1 β j 0 for j = 2,3 ,…, 7,8 Where β 2 , ... , β 8 are the coefficients of the dummy variables for sector . OR: H 0 : wage~1 + exper H A : wage ~ 1 + exper + sector OR: H 0 : E ( wage )= β 0 + β 1 exper H A : E ( wage )= β 0 + β 1 exper + β 2 sectorconst + β 3 sectormanag + β 4 sectormanuf + β 5 sectorother + β 6 sectorprof + β 7 sectorsales + β 8 sectorservice (ii) From R, the anova output is > anova(modC,modD) Analysis of Variance Table
STAT 3032 Regression and Correlated Data Model 1: wage ~ 1 + exper Model 2: wage ~ 1 + exper + sector Res.Df RSS Df Sum of Sq F Pr(>F) 1 532 13970 2 525 11292 7 2678 17.787 < 2.2e-16 *** Thus, the test statistic value is F = 17.787 with a corresponding p value less than 2.2 × 10 16 , which is almost 0. (iii) Since the p value is less than the significance level, we reject the null and conclude there should be a different intercept for each level of sector , i.e., that model C is insufficient. Model D is preferred. (f)_[1 pt] Review the summary of Model D in Part (b). The coefficient table contains the results of the t tests for the coefficients ( H 0 : coefficient = 0 vs. H A : coefficient ≠ 0 ). Let’s focus on the 7 coefficients of the dummy variables of sector . Do all of these t tests reject the null hypotheses ( coefficient = 0 )? Please use 0.05 as the significance level. Solution: No, only the t tests for the slopes of sectormanag and sectorprof reject the null hypotheses. (g)_[2 pts] Explain in words why the conclusion of the Partial F test in Part (e) does not contradict with the results of the t tests in Part (f). Solution: The partial F test is a test of whether all of slopes of the dummy variables for sector are zero vs whether at least one is nonzero. We rejected the null hypothesis in the partial F test, which means that some slopes are not zero. This conclusion is consistent with the results of the t tests, since the t tests show that 2 slopes are significantly different from 0.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help