ISOM835_Lab 7_LR II_solution

pdf

School

Suffolk University *

*We aren’t endorsed by this school

Course

835

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

25

Uploaded by SuperHumanRainStarling24

Report
1 Lab 7: Linear Regression – Advanced Topics – Solution ISOM 835 Predictive Analytics Sawyer Business School Dr. Kate Li ISOM This lab consists of two portions: 1. SAS Enterprise Guide Demonstration: Linear Regression – Advanced Topics (20 pts) 2. Questions 1-4 (80 pts) While you work through each lab, please take notes of things that you don’t understand and/or are not sure about. I will give you time to ask questions during the following class. Complete the following questions by yourself Question 1 (22 pts): Examining Residuals a. (4 pts) Import BodyFat2.csv and run a regression of PctBodyFat2 on Weight , Abdomen , Forearm , and Wrist and create diagnostic plots. Provide screenshots of the regression result.
2 b. (4 pts) Do the residual plots indicate any problems with the constant variance assumption? It does not appear that the data violate the assumption of constant variance.
3 c. (4 pts) Is there any indication of nonlinear relationship between the response variable and the predictors by the evidence in the residual plots? The residual plots do not suggest any obvious nonlinear relationship. d. (4 pts) Does the Quantile-Quantile plot indicate any problems with the normality assumption? The normality assumption seems to be met, although some points at the two ends deviate from the reference line more noticeably.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 e. (6 pts) Generate the plots of (1) studendized residuals, (2) Cook’s D, (3) DFFITS, and (4) DFBETAS, and make sure that observation number of potential influential points are printed on the plots. Provide the plots and comment on which observations are identified as potential influential observations based on the suggested cutoff values of the statistics. Steps: 1) Modify the previous task. 2) With Plots selected at the left, select Custom list of plots and then check the boxes for Studentized residuals by predicted values plot , Plot Cook’s D statistic , DFFITS plots , and DFBETAS plots . Uncheck the box next to Diagnostic Plots . 3) With Predictions selected at the left: a. Check the box for Original sample under Data to predict . b. Check Predictions and Diagnostic statistics under Save output data . c. Check the box for Residuals under Additional statistics . 4) Click Save . Do not replace the results from the previous run. 5) Right-click the previous task and select Add as a Code Template . 6) Double-click the node for the code in order to edit it and find the PROC REG section of the code. 7) Make the following changes: PLOTS(ONLY LABEL)=RSTUDENTBYPREDICTED PLOTS(ONLY LABEL)=COOKSD PLOTS(ONLY LABEL)=DFFITS PLOTS(ONLY LABEL)=DFBETAS
5 ! Add the option (LABEL) within the parentheses after the words PLOTS. 8) Click Save and then select Save As to name and locate the file. 9) Click Run above the code window. Studendized residual plot: There are only a modest number of observations farther than two standard error units from the mean of 0. Cook’s D plot: There are 10 labeled outliers, but observation 39 is clearly the most extreme.
6 DFFITS plot: The same observations are shown to be influential by the DFFITS statistic. DFBETAS plot: DFBETAS are extreme for observation 39 on the parameters for Weight and Forearm circumference.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Question 2 (18 pts): Using Model Building Techniques Use the BodyFat2 data set to identify a set of “best” models. a. (6 pts) Using the Mallows’ C p criterion, use the best subset selection technique to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age , Weight , Height , Neck , Chest , Abdomen , Hip , Thigh , Knee , Ankle , Biceps , Forearm , and Wrist . In the plots, only show the best 30 models. Save a screenshot of the result and the plot showing Mallows’ C p of the models. Which predictors would you have in a candidate model based on the result? Why? From the Mallows’ C p plot, we can see that a good candidate model would be one with eight or nine parameters, i.e., with eight or seven predictors, respectively. From the result table, we can see that the lowest Mallows’ C p is for the model with seven predictors, Age, Weight, Neck, Abdomen, Thigh, Forearm, and Wrist . This would be a candidate model for further consideration.
8 Partial output:
9 b. (6 pts) Use Forward Selection to select a candidate model. Save a screenshot of the result and the plot showing AIC of the models. Which predictors would you have in a candidate model based on the result? Why?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Based on the AIC criterion, the best candidate model is the one identified in Step 8, which can be seen from the result table that it includes eight predictors, Age, Weight, Neck, Abdomen, Hip, Thigh, Forearm, and Wrist . c. (6 pts) Perform Forward Selection again, but change the significance level for entry criterion to 0.05, instead of the default of 0.50. Save a screenshot of the result and the plot showing AIC of the models. Based on AIC, which predictors would you have in a candidate model based on the result? Is it different from the candidate model identified in Part (b)?
11 Partial output:
12 Based on AIC, the best candidate model is the model identified in Step 4, which includes four predictors, Weight, Abdomen, Forearm, and Wrist . As we can see, when the significance level for entry criterion becomes stricter (from a p-value of 0.5 to 0.05), fewer predictors are admitted to the candidate model. Question 3 (18 pts) : Use Advertising.csv data set to answer the following questions. The Advertising data set contains four variables: sales : the sales of a product in a market, measured in 1,000 units TV , radio , newspaper : advertising spending on TV, radio, and newspaper for that product in that market, respectively, measured in $1,000 a. (5 pts) Fit a multiple linear regression model in which sales is the response variable and the other three variables are the predictors. Provide a screenshot of the regression result.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 b. (4 pts) Is the model useful for understanding and/or predicting sales based on advertising spending? Why? The model is useful because: (1) p-value of the F statistic is <0.0001, which rejects the null hypothesis that all coefficients are zero; (2) Adjusted R 2 is 0.8956, which means that almost 90% of the variation in sales is explained by advertising spending on TV, radio, and newspaper. c. (4 pts) Based on the data, which advertising channels contribute to sales? Why? Statistically speaking, TV and radio ads contribute to sales because both have a p-value of <0.0001. However, newspaper advertising doesn’t because its p-value is large, 0.8599. d. (5 pts) Provide the residual plot in which predicted value is on the x-axis and residual is on the y-axis. Comment on the residual plot. Do you observe any pattern? If so, what does it suggest? The residual plot exhibits a nonlinear pattern (a convex pattern to be more precise), which suggests that the relationship between sales and the advertising channels is not linear. We should try to add nonlinear terms in the regression model.
14 Question 4 (22 pts) : Model Comparison a. (5 pts) Generate three new variables (i.e., columns) in the Advertising data set imported in Question 2: TV2 (= TV*TV) , radio2 (=radio*radio) , and TV_radio (=TV*radio) . Keep all original variables as well. Provide a screen shot of the first 10 rows of the data set after generating the variables. Answer: Key steps: Generate TV2 :
15 All three variables: b. (5 pts) Divide the data set generated in (a) into a training and a test data set: the first 150 observations (i.e., observations with F1 value of less than or equal to 150, where F1 is a variable in the Advertising same data set labeling the observations) belong to the training set, and the rest goes to the test set. Name
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 the training and test sets Ad_tr and Ad_ts , respectively, and save them in the WORK library. Provide screenshots of the last five observations in each data set. Ad_tr: Ad_ts: Key steps: Generate Ad_tr:
17 Generate Ad_ts:
18 c. (5 pts) Define Model 1: sales is the response variable, and there are four predictors: TV, radio, TV 2 , and radio 2 . Use the training set, Ad_tr to estimate Model 1, and then use the estimated Model 1 to make predictions for the test set, Ad_ts . Provide screenshots of (1) the regression result of Model 1, (2) the first 10 rows of the prediction result of Ad_ts , and (3) the training and test MSEs. Regression result of Model 1:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 The first 10 rows of the prediction result of Ad_ts :
20 Training MSE: Test MSE: Key steps: Save predicted values of sales in the output data set of Model 1’s regression result:
21 Calculate training MSE: First: squared errors (of each observation in the training set):
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 Second: calculate the mean of all the squared errors: Generate predictions for the test set and calculate test MSE: Make sure that when we make predictions for the test set (i.e., Ad_ts), we should still fit the regression model using the training set (i.e., Ad_tr); otherwise it defeats the purpose of dividing the data set into training and test sets.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
23 Then select the test set for prediction: d. (5 pts) Define Model 2: sales is the response variable, and there are three predictors: TV, radio, TV*radio . Use the training set, Ad_tr to estimate Model 2, and then use the estimated Model 2 to make predictions for the test set, Ad_ts . Provide screenshots of (1) the regression result of Model 2, (2) the first 10 rows of the prediction result of Ad_ts , and (3) the training and test MSEs. Regression result of Model 2:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24 The first 10 rows of the prediction result of Ad_ts :
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25 Training MSE: Test MSE: (The training MSE is 0.92 while the test MSE is 0.75. This is somewhat unusual that test MSE is smaller than training MSE. But since we only divide the data into a training and a test set once, this could happen just by chance.) e. (2 pts) Which model would you choose to make predictions? Why? For the test set Ad_ts , Model 1 has a test MSE of 2.42 while Model 2 has a test MSE of 0.75. When we plan to use a model for out of sample prediction, we should choose the one with test MSE. Therefore Model 2 is a better choice. ************************************************************************************* This is the end of Lab 7. Please submit the ISOM835_Lab7_ your initials .docx Word document via Blackboard.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help