Stepwise Selection and Cross-Validation in Linear Regression

1 Stephanie Padin ESI 4606 Analytics I - Foundations of Data Science Homework 5 Due: Nov. 8 st (11:00AM), 2023 Problem 1 (1 point) Explain how to utilize forward stepwise selection and 5-fold cross-validation to select the best combination of predictors for multiple linear regression, y i = β 0 + β 1 x i 1 + ... + β 8 x i 8 + ε i , i = 1 ,..., N , where N is the sample size. (Use words, figures and mathematical notations to provide a clear description) Step 1: Initializing the Model Begin with a model that has no variables except for the intercept term β0. The model is represented by the equation y_i = β0 + ε_i, where y_i is the dependent variable, β0 is the intercept term, and ε_i is the error term. y = β0 + ε Step 2: Iterative Selection Iterative Selection is a method used to build a model by gradually adding predictors that improve its performance. The process starts by selecting the predictor variable that results in the lowest error when added to the model. This is determined by fitting the model with each remaining predictor individually and selecting the one that gives the best result based on a metric like Mean Squared Error or R-squared. Then, the next predictor that improves the model the most is added, and this process continues until a stopping criterion is met. This could be a predetermined number of predictors or until there is no significant improvement in the model's performance. The model grows by one predictor at each step of the process. y = β0 + β1x1 + ε

2 Step 3: 5-Fold Cross Validation 5-Fold Cross-Validation is a technique used to evaluate the performance of a model on unseen data. It involves dividing the dataset into 5 equal-sized folds. In each iteration, 4 folds are used for training and 1 fold is used for testing. The model is trained on the training set and evaluated on the test set, with a performance metric such as Mean Squared Error recorded. This process is repeated for all 5 folds, with the fold used for testing rotating each time. Finally, the average performance metric across all folds is calculated to estimate how well the model is likely to perform on unseen data. y = β0 + β1x1 + β2x2 + ε Step 4: Recording the Performance Record the average performance metric obtained through 5-fold cross-validation for each combination of predictors to determine how well they generalize to unseen data. Step 5: Selecting the Best Combination Here there is an option to choose the combination of predictors with the lowest average performance metric as it is considered the best combination based on the chosen evaluation metric. Step 6: Final Model: Train the final model using the chosen predictors on the complete dataset. Step 7: Evaluating the Model Performance Assess the model's performance on an independent dataset to validate its ability to generalize. y = β0 + β1x1 + β2x2 + ... + βnxn + ε Problem 2 (3 points)

3 Questions in this problem should be answered using the data set file of "HM5.txt". x 1 , x 2 , ..., x 10 are predictors and y is the response variable. (a) Perform the best subset selection to choose the best model from the possible predictors, x 1 , x 2 , ..., x 10 . What is the best model obtained according to C p , BIC and adjusted R 2 , respectively? Show some plots to provide evidence for your answer, and report the coefficients of the corresponding best model obtained.

Your preview ends here