S23 - Assignment #3 - Solutions

pdf

School

University of Waterloo *

*We aren’t endorsed by this school

Course

371

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

9

Uploaded by BrigadierAntelopePerson2879

Report
STAT 371 S23 Assignment #3 (Submission deadline: 11:59 pm Fri. Jul. 14th) Solutions ( /70) In this assignment, we will continue with developing a suitable regression model for the CEO data from Assignment #2, beginning with your fitted model used in 2e) of Assignment #2 (i.e. model without Background variate). 1) [5] Plot the residuals vs the fitted values, as well as a QQ plot. Comment on the adequacy of the fitted model, in terms of the model assumptions. We do not appear to have an adequate model. The pattern evident in the plot of the residuals vs the fitted values reveals a misspecification of the functional form and/or non-constant variance. The departure from a straight line relationship in the qq plot is in contradiction to the assumption of normal errors. 2) One approach to stabilize the variance of the residuals and/or more adequately describe the relationship between a response variate and the explanatory variates is with an appropriate transformation of the response variate. a) [3] Create a histogram of CEO compensation. What characteristic of this variate might lead you to suspect that a log transformation may be suitable? The right-skewness in the distribution suggests that a log transformation might help to normalize the response.
b) [2] Refit the data using the (natural) log transformation of compensation. Call: lm(formula = log(COMP) ~ AGE + EDUCATN + TENURE + EXPER + SALES + VAL + PCNTOWN + PROF) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.897e+00 7.211e-01 9.565 7.02e-13 AGE -1.938e-03 1.208e-02 -0.160 0.87324 EDUCATN -3.082e-01 1.160e-01 -2.658 0.01054 TENURE 7.004e-03 6.981e-03 1.003 0.32051 EXPER 1.533e-02 9.554e-03 1.605 0.11489 SALES 2.508e-05 1.636e-05 1.533 0.13151 VAL 1.236e-03 6.158e-04 2.008 0.05011 PCNTOWN -7.308e-02 2.699e-02 -2.708 0.00924 PROF 2.325e-04 3.502e-04 0.664 0.50968 --- Residual standard error: 0.4705 on 50 degrees of freedom Multiple R-squared: 0.4178, Adjusted R-squared: 0.3246 F-statistic: 4.485 on 8 and 50 DF, p-value: 0.0003771 c) [2] Compare the overall fit of the model and significance of the individual parameters with that of the original (untransformed) model. An R-squared value of .4178 indicates that less than 42% of the variation in (log) compensation is accounted for by the variables in the model. This is a slight improvement over the fit of the untransformed model (.4031) PCNTOWN and EDUCATN appear to be the only variable with a significant relationship with compensation, after accounting for the other variables. Note that EXPER has been rendered insignificant by the transformation. d) [4] Replot the two residual plots in 1). Has the transformation helped to address the issues with the adequacy of the (untransformed) model? Yes, the transformation has certainly helped to address the model adequacy issues. The plot of the residuals vs fitted is more randomly scattered. Improvement is also seen in the QQ plot.
3) We can also investigate the suitability of transformations of one or more of the explanatory variates by looking at scatterplots of the variates vs the response (log(COMP), in this case). a) [3] Create a scatterplot of SALES vs log(COMP). Does a linear model seem appropriate for these two variates? No, the relationship between log(COMP) and SALES is not linear. b) [3] Create a scatterplot of log(SALES) vs log(COMP). Comment. The relationship between log(COMP) and log(SALES) appears much more linear (although there appears to be some non-linearity in the relationship for high sales) c) [4] Refit the model once again, this time taking the log transformation of compensation as well as of the variates SALES, VAL, PCNTOWN and PROF. We will use this model going forward. Comment on the effect these transformations have on the overall fit of the model, and on the p-values of the associated variates.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Call: lm(formula = log(COMP) ~ AGE + EDUCATN + TENURE + EXPER + log(SALES) + log(VAL) + log(PCNTOWN) + log(PROF)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.531845 0.897256 6.165 1.21e-07 AGE 0.002864 0.012122 0.236 0.81418 EDUCATN -0.300500 0.114331 -2.628 0.01137 TENURE -0.003343 0.006511 -0.513 0.60993 EXPER 0.015146 0.010056 1.506 0.13830 log(SALES) 0.188393 0.080064 2.353 0.02260 log(VAL) 0.315447 0.096467 3.270 0.00195 log(PCNTOWN) -0.351228 0.105022 -3.344 0.00157 log(PROF) -0.221603 0.104300 -2.125 0.03858 --- Residual standard error: 0.4428 on 50 degrees of freedom Multiple R-squared: 0.4842, Adjusted R-squared: 0.4017 F-statistic: 5.867 on 8 and 50 DF, p-value: 2.732e-05 The transformations of the explanatory variables appears to have improved the fit of the model substantially, as indicated by the increased R-squared value of .4842 (some of you may not experience the same increase, depending on your sample). There are several variables with associated p-values < .05, including education level, and all the log transformed variables (SALES, VAL, PCNTWN, PROF). 4) [4] Plot the residuals vs the fitted values and the QQ plot for the model in 3). Comment on the effect of the transformations on the model assumptions. The transformations have improved model adequacy considerable. The model appears to be well specified with a relatively constant variance (based on the plot of the residuals vs fitted values), and the qq plot suggests that the assumption of normal errors appears to be more reasonably met than with the untransformed variables.
5) [4] Replot the plots in 4) using the studentized residuals. Do you notice any major changes in these plots? Are there any outliers present? There does not appear to be any substantial difference in the plots of the studentized residuals. The only difference is in the scale of the residuals. There do not appear to be any major outliers present. There is one studentized residual that is approx. = 2.5, but nothing to be too concerned about, based on the plot. 6) [3] Plot the hat values vs index (observation number). Are there any high leverage points? There does appear to be one observation (#39) in particular that has a high leverage relative to the other observations. A few other observations are also above the established threshold of 2( 1) 18 .31 59 p n + = = . 7) [3] Investigate the observation with the highest leverage for a possible cause. The observation associated with the highest leverage is #39. Looking at observations #37 - #41 for reference: COMP AGE EDUCATN TENURE EXPER SALES VAL PCNTOWN PROF 514 57 2 3 3.0 661 4.1 0.17 79 466 48 2 17 1.0 1539 0.2 0.01 189 2244 64 2 41 5.0 4451 4.0 0.04 30 476 50 1 20 0.5 3148 1.3 0.05 260 809 59 1 38 0.5 19196 0.4 0.01 505 The relatively large values for some variables (e.g. AGE, TENURE, EXPER,VAL), along with the relatively small values of others (e.g. PROFIT) contributed to the relatively high leverage.
8) [3] Plot the Cook’s Distance values. Are there any influential cases? > plot(cooks.distance(ceo.log.lm),pch=19) Although there does appear to be one observation with a relatively large influence, there are no influential cases based on the established threshold of > 1. 9) Now that we have obtained a more adequate model through transformation of the response and some of the explanatory variables, we can further improve the model by using model selection methods.to select which subset of variables to include. a) [4] Use back ward selection to arrive at a reasonable model (use α = .1 5). Show your work. Starting with the full model: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.531845 0.897256 6.165 1.21e-07 AGE 0.002864 0.012122 0.236 0.81418 EDUCATN -0.300500 0.114331 -2.628 0.01137 TENURE -0.003343 0.006511 -0.513 0.60993 EXPER 0.015146 0.010056 1.506 0.13830 log(SALES) 0.188393 0.080064 2.353 0.02260 log(VAL) 0.315447 0.096467 3.270 0.00195 log(PCNTOWN) -0.351228 0.105022 -3.344 0.00157 log(PROF) -0.221603 0.104300 -2.125 0.03858 --- Residual standard error: 0.4428 on 50 degrees of freedom Multiple R-squared: 0.4842, Adjusted R-squared: 0.4017 F-statistic: 5.867 on 8 and 50 DF, p-value: 2.732e-05 Removing AGE: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.702274 0.528686 10.786 9.19e-15 EDUCATN -0.307618 0.109265 -2.815 0.006909 TENURE -0.002878 0.006150 -0.468 0.641734 EXPER 0.016290 0.008732 1.866 0.067868 log(SALES) 0.189969 0.079043 2.403 0.019920 log(VAL) 0.316256 0.095510 3.311 0.001711 log(PCNTOWN) -0.356238 0.101903 -3.496 0.000988 log(PROF) -0.229686 0.097613 -2.353 0.022516
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Removing TENURE: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.714843 0.524024 10.906 4.78e-15 EDUCATN -0.295128 0.105158 -2.807 0.007032 EXPER 0.014959 0.008194 1.826 0.073656 log(SALES) 0.184355 0.077539 2.378 0.021139 log(VAL) 0.312138 0.094387 3.307 0.001715 log(PCNTOWN) -0.352216 0.100775 -3.495 0.000978 log(PROF) -0.236776 0.095704 -2.474 0.016661 Residual standard error: 0.4354 on 52 degrees of freedom Multiple R-squared: 0.4814, Adjusted R-squared: 0.4216 F-statistic: 8.046 on 6 and 52 DF, p-value: 3.651e-06 This is our selected model, based on backward selection. b) [6] Use the leaps function in R to select a model, based on Mallow’s C p and adjusted R 2 . (You may first need to install and load the leaps package). Select the model that yields the largest adjusted R-squared and meets the Mallow’s C p criterion. Comment on the overall fit and the significance of the model parameters. > library(leaps) >leaps(cbind(AGE,EDUCATN, …, log(PROF)),log(COMP),method=c("adjr2"),nbest=2 ,names=c("AGE","EDUCATN", "log(PCNTOWN)","log(PROF)")) AGE EDUCATN TENURE EXPER log(SALES) log(VAL) log(PCNTOWN) log(PROF) 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE 2 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE 2 FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE 3 FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE 3 FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE 4 FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE 4 TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE 5 FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE 5 FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE 6 FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE 6 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE 7 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 7 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE $adjr2 [1] 0.21301031 0.09380483 0.28349450 0.27501624 0.33457695 0.32424269 [7] 0.35899106 0.34617966 0.39612056 0.37080026 0.42157977 0.39737544 [13] 0.41276082 0.41032346 0.40168408 With Cp values: $Cp [1] 19.974458 31.330854 14.062076 14.855607 10.168802 11.118774 8.853187 [8] 10.009458 6.492828 8.735749 5.270853 7.374466 7.055827 7.263585 [15] 9.000000
The model selected is the one that includes all the variables except AGE and TENURE: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.714843 0.524024 10.906 4.78e-15 EDUCATN -0.295128 0.105158 -2.807 0.007032 EXPER 0.014959 0.008194 1.826 0.073656 log(SALES) 0.184355 0.077539 2.378 0.021139 log(VAL) 0.312138 0.094387 3.307 0.001715 log(PCNTOWN) -0.352216 0.100775 -3.495 0.000978 log(PROF) -0.236776 0.095704 -2.474 0.016661 --- Residual standard error: 0.4354 on 52 degrees of freedom Multiple R-squared: 0.4814, Adjusted R-squared: 0.4216 F-statistic: 8.046 on 6 and 52 DF, p-value: 3.651e-06 The adjusted R-squared value (0.4216) is increased from the full model (.4017). All the variables but experience are associated with a p-value < .05. c) [4] Confirm the Mallow’s Cp value for this model by calculating the value from information in the summary output of this model and of the full model. SS(Res) 2( 1) MS(Res) p full C k n = + + SS(Res) of selected model: 2 .4354 (52) 9.8578 = MS(Res) of full model (see full model of backward selection in 9a): 2 .4428 2 9.8578 2(7) 59 5.2765 .4428 p C = + = (slight discrepancy from leaps value of 5.270853 due to round-off error) d) [1] Did the model selection procedures in a) and b) arrive at the same model? In this case, yes (however, different model selection methods will not always yield the same model). e) [3] Confirm that the model you obtained in b) is preferred to the full model (from 3c)) by using the anova function to perform an additional sum of squares test. Analysis of Variance Table Model 1:log(COMP)~EDUCATN+EXPER+log(SALES)+log(VAL)+log(PCNTOWN)+log(PROF) Model 2:log(COMP)~AGE+EDUCATN+BACKf+TENURE+EXPER+log(SALES)+log(VAL)+ log(PCNTOWN) + log(PROF) Res.Df RSS Df Sum of Sq F Pr(>F) 1 52 9.8588 2 46 9.3004 6 0.55845 0.4604 0.8339 p-value .8339 = we do not reject the null hypothesis that the reduced model is preferred. After accounting for the other variables, none of AGE, TENURE, or BACKGROUND are significantly related to compensation.
f) [4] Plot the studentized residuals vs the fitted values and the QQ plot of the studentized residuals to confirm that your preferred model is adequate in terms of the model assumptions. The selected model appears adequate. There is no observable pattern in the plot of the residuals vs the fitted values, no outliers, and the variance appears reasonably constant. The normal plot suggests that the assumption of normality of the errors is reasonably well met, although there is some curvature in the plot. g) [5] Finally, recalculate the 95% prediction interval for the CEO in Assignment #2, 2f) based on your preferred model. Be sure to back transform to the original units. Is the predicted compensation for this CEO more precise? Why or why not? > new_x_ceo75p.log = data.frame(EDUCATN=1,EXPER=8,SALES=3250,VAL=8.2, PCNTOWN=2,PROF=112) > predict(ceo.bs3.lm,new_x_ceo75p.log,interval='prediction',level=.95) fit lwr upr 1 6.325579 5.378192 7.272965 Backtransforming to original scale yields a 95% prediction interval for this CEO’s compensation of: 5.378192 7.272965 ( , ) ($216630,$1440816) e e = Despite having fewer variables, the width of the interval based on our preferred model is much narrower than the interval (-$2091111, $2092008) based on the original (untransformed) model with all variables included. The increase in precision is a consequence of both the removal of insignificant variables, and the improvement in fit resulting from the appropriate transformations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help