a6-solution

pdf

School

Rumson Fair Haven Reg H *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by CoachRiverTiger30

Assignment 6: Linear Model Selection SDS293 - Machine Learning Due: 1 November 2017 by 11:59pm Conceptual Exercises 6.8.2 (p. 259 ISLR) For parts each of the following, indicate whether each method is more or less flexible than least squares. Describe how each method’s trade-off between bias and variance impacts its prediction accuracy. Justify your answers. (a) The lasso Solution: Puts a budget constraint on least squares. It is therefore less flexible. The lasso will have improved prediction accuracy when its increase in bias is less than its decrease in variance. (b) Ridge regression Solution: For the same reason as above, this method is also less flexible. Ridge regression will have improved prediction accuracy when its increase in bias is less than its decrease in variance. (c) Non-linear methods (PCR and PLS) Solution: Non-linear methods are more flexible and will give improved prediction accuracy when their increase in variance are less than their decrease in bias. 6.8.5 (p. 261) Ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting. Suppose that n = 2 , p = 2 , x 11 = x 12 , x 21 = x 22 . Furthermore, suppose that y 1 + y 2 = 0 and x 11 + x 21 = 0 and x 12 + x 22 = 0, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: ˆ β 0 = 0. 1

(a) Write out the ridge regression optimization problem in this setting. Solution: In general, Ridge regression optimization looks like: min   1 ...n X i ( y i - ˆ β 0 - 1 ...p X j ˆ β j x j ) 2 + λ 1 ...p X i ˆ β 2 i   In this case, ˆ β 0 = 0 and n = p = 2 . So, the optimization simplifies to: min h ( y 1 - ˆ β 1 x 11 - ˆ β 2 x 12 ) 2 + ( y 2 - ˆ β 1 x 21 - ˆ β 2 x 22 ) 2 + λ ( ˆ β 2 1 + ˆ β 2 2 ) i (b) Argue that in this setting, the ridge coefficient estimates satisfy ˆ β 1 = ˆ β 2 . Solution: We know the following: x 11 = x 12 , so we’ll call that x 1 , and x 21 = x 22 , so we’ll call that x 2 . Plugging this into the above, we get: min h ( y 1 - ˆ β 1 x 1 - ˆ β 2 x 1 ) 2 + ( y 2 - ˆ β 1 x 2 - ˆ β 2 x 2 ) 2 + λ ( ˆ β 2 1 + ˆ β 2 2 ) i Taking the partial derivatives of the above with respect to ˆ β 1 and ˆ β 2 and setting them equal to 0 will give us the point at which the function is minimized. Doing this, we find: ˆ β 1 ( x 2 1 + x 2 2 + λ ) + ˆ β 2 ( x 2 1 + x 2 2 ) - y 1 x 1 - y 2 x 2 = 0 and ˆ β 1 ( x 2 1 + x 2 2 ) + ˆ β 2 ( x 2 1 + x 2 2 + λ ) - y 1 x 1 - y 2 x 2 = 0 Since the right-hand side of both equations is identical, we can set the two left-hand sides equal to one another: ˆ β 1 ( x 2 1 + x 2 2 + λ ) + ˆ β 2 ( x 2 1 + x 2 2 ) - y 1 x 1 - y 2 x 2 = ˆ β 1 ( x 2 1 + x 2 2 ) + ˆ β 2 ( x 2 1 + x 2 2 + λ ) - y 1 x 1 - y 2 x 2 and then cancel out common terms: ˆ β 1 ( x 2 1 + x 2 2 + λ ) + ˆ β 2 ( x 2 1 + x 2 2 ) - y 1 x 1 - y 2 x 2 = ˆ β 1 ( x 2 1 + x 2 2 ) + ˆ β 2 ( x 2 1 + x 2 2 + λ ) - y 1 x 1 - y 2 x 2 ˆ β 1 ( x 2 1 + x 2 2 ) + ˆ β 1 λ + ˆ β 2 ( x 2 1 + x 2 2 ) = ˆ β 1 ( x 2 1 + x 2 2 ) + ˆ β 2 ( x 2 1 + x 2 2 ) + ˆ β 2 λ ˆ β 1 λ + ˆ β 2 ( x 2 1 + x 2 2 ) = ˆ β 2 ( x 2 1 + x 2 2 ) + ˆ β 2 λ ˆ β 1 λ = ˆ β 2 λ Thus, ˆ β 1 = ˆ β 2 . (c) Write out the lasso optimization problem in this setting. Solution: min h ( y 1 - ˆ β 1 x 11 - ˆ β 2 x 12 ) 2 + ( y 2 - ˆ β 1 x 21 - ˆ β 2 x 22 ) 2 + λ ( | ˆ β 1 | + | ˆ β 2 | ) i 2

(d) Argue that in this setting, the lasso coefficients ˆ β 1 and ˆ β 2 are not unique – in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions. Solution: One way to demonstrate that these solutions are not unique is to make a geometric argument. To make things easier, we’ll use the alternate form of Lasso constraints that we saw in class, namely: | ˆ β 1 | + | ˆ β 2 | < s . If we were to plot these constraints, they take the familiar shape of a diamond centered at the origin (0 , 0) . Next we’ll consider the squared optimization constraint, namely: ( y 1 - ˆ β 1 x 11 - ˆ β 2 x 12 ) 2 + ( y 2 ˆ β 1 x 21 - ˆ β 2 x 22 ) 2 Using the facts we were given regarding the equivalence of many of the variables, we can simplify down to the following optimization: min h 2( y 1 - ( ˆ β 1 + ˆ β 2 ) x 11 i This optimization problem has a minimum at ˆ β 1 + ˆ β 2 = y 1 x 11 , which defines a line parallel to one edge of the Lasso-diamond ˆ β 1 + ˆ β 2 = s . As ˆ β 1 and ˆ β 2 vary along the line ˆ β 1 + ˆ β 2 = y 1 x 11 , these contours touch the Lasso-diamond edge ˆ β 1 + ˆ β 2 = s at different points. As a result, the entire edge ˆ β 1 + ˆ β 2 = s is a potential solution to the Lasso optimization problem! A similar argument holds for the opposite Lasso-diamond edge, defined by: ˆ β 1 + ˆ β 2 = - s . Thus, the Lasso coefficients are not unique. The general form of solution can be given by two line segments: ˆ β 1 + ˆ β 2 = s ; ˆ β 1 ≥ 0; ˆ β 2 ≥ 0 and ˆ β 1 + ˆ β 2 = - s; ˆ β 1 ≤ 0; ˆ β 2 ≤ 0 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Applied Exercises 6.8.9 (p. 263 ISLR) In this exercise, we will predict the number of applications received using the other variables in the College data set. For consistency, please use set.seed(11) before beginning. (a) Split the data set into a training set and a test set. (b) Fit a linear model using least squares on the training set, and report the test error obtained. (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. (d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. (e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. (f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation. (g) Comment on the results you obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches? 4

A6 Applied Solutions 6.8.9 (a) library (ISLR) library (dplyr) Check to make sure we don’t have any null values sum ( is.na (College)) ## [1] 0 Split the data set into a training set and a test set. set.seed ( 1 ) train = College %>% sample_frac ( 0.5 ) test = College %>% setdiff (train) 6.8.9 (b) Fit a linear model using least squares on the training set, and report the test error obtained. lm_fit = lm (Apps~., data = train) lm_pred = predict (lm_fit, test) mean ((test[, "Apps" ] - lm_pred)^ 2 ) ## [1] 1108531 6.8.9 (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. library (glmnet) # Build model matrices for # test and training data train_mat = model.matrix (Apps~., data = train) test_mat = model.matrix (Apps~., data = test) # Find best lambda using cross-validation, # alpha = 0 --> use ridge regression grid = 10 ^ seq ( 4 , - 2 , length= 100 ) mod_ridge = cv.glmnet (train_mat, train[, "Apps" ], alpha = 0 , lambda = grid, thresh = 1e-12 ) 1

lambda_best_ridge = mod_ridge$lambda.min # Predict on test data, report error ridge_pred = predict (mod_ridge, newx = test_mat, s = lambda_best_ridge) mean ((test[, "Apps" ] - ridge_pred)^ 2 ) ## [1] 1108512 6.8.9 (d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates. # Find best lambda using cross-validation, # alpha = 1 --> use lasso mod_lasso = cv.glmnet (train_mat, train[, "Apps" ], alpha = 1 , lambda = grid, thresh = 1e-12 ) lambda_best_lasso = mod_lasso$lambda.min # Predict on test data, report error lasso_pred = predict (mod_lasso, newx = test_mat, s = lambda_best_lasso) mean ((test[, "Apps" ] - lasso_pred)^ 2 ) ## [1] 1028718 predict (mod_lasso, newx = test_mat, s = lambda_best_lasso, type= "coefficients" ) ## 19 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) -4.248125e+02 ## (Intercept) . ## PrivateYes -4.955003e+02 ## Accept 1.540306e+00 ## Enroll -3.900157e-01 ## Top10perc 4.779689e+01 ## Top25perc -7.926581e+00 ## F.Undergrad -9.846932e-03 ## P.Undergrad . ## Outstate -5.231286e-02 ## Room.Board 1.880308e-01 ## Books 1.265938e-03 ## Personal . ## PhD -4.137294e+00 ## Terminal -3.184316e+00 ## S.F.Ratio . ## perc.alumni -2.181304e+00 ## Expend 3.193679e-02 ## Grad.Rate 2.877667e+00 6.8.9 (e) Results for OLS, Lasso, Ridge are comparable. Lasso reduces the P.Undergrad , Personal and S.F.Ratio variables to zero and shrinks coefficients of other variables. Below are the test R 2 values for all models. 2

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

test_avg = mean (test[, "Apps" ]) lm_test_r2 = 1 - mean ((test[, "Apps" ] - lm_pred)^ 2 ) / mean ((test[, "Apps" ] - test_avg)^ 2 ) ridge_test_r2 = 1 - mean ((test[, "Apps" ] - ridge_pred)^ 2 ) / mean ((test[, "Apps" ] - test_avg)^ 2 ) lasso_test_r2 = 1 - mean ((test[, "Apps" ] - lasso_pred)^ 2 ) / mean ((test[, "Apps" ] - test_avg)^ 2 ) barplot ( c (lm_test_r2, ridge_test_r2, lasso_test_r2), ylim= c ( 0 , 1 ), names.arg = c ( "OLS" , "Ridge" , "Lasso" ), main = "Test R-squared" ) abline ( h = 0.9 , col = "red" ) OLS Ridge Lasso Test R-squared 0.0 0.2 0.4 0.6 0.8 1.0 Since the test R 2 values for all three models are above .90, they all predict the number of college applications with high accuracy. 3