HW_8

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6501

Subject

Aerospace Engineering

Date

Dec 6, 2023

Type

pdf

Pages

16

Uploaded by DrCrocodile2618

Report
HW8 11 . Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using: 1. Stepwise regression 2. Lasso 3. Elastic net 11.1.1 Stepwise Regression Below is the implementation of Stepwise regression. I have used the olsrr package in R to do this. Initially I build a linear regression model with all the predictors to check how the model is doing and as you can see below, I got an R2 value of 0.7078. Then to improve the model and applied stepwise regression based on p value (backward). Any predictors with p<0.05 were excluded from the model and the R2 value went up to 0.731. I also plotted the various metrics for the model and they are also shown in the table output below. #removing all environment variables and importing libraries rm( list= ls()) library(olsrr) ## ## Attaching package: 'olsrr' ## The following object is masked from 'package:datasets': ## ## rivers #importing data into a df df = read.table( "USCrime.txt" , header= TRUE) #apply linear regression model on all the variables in the dataset model = lm(Crime~., data= df) #model summary summary(model) ## ## Call: ## lm(formula = Crime ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -395.74 -98.09 -6.69 112.99 512.67 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 *** ## M 8.783e+01 4.171e+01 2.106 0.043443 * ## So -3.803e+00 1.488e+02 -0.026 0.979765 ## Ed 1.883e+02 6.209e+01 3.033 0.004861 ** ## Po1 1.928e+02 1.061e+02 1.817 0.078892 . ## Po2 -1.094e+02 1.175e+02 -0.931 0.358830 ## LF -6.638e+02 1.470e+03 -0.452 0.654654 ## M.F 1.741e+01 2.035e+01 0.855 0.398995 ## Pop -7.330e-01 1.290e+00 -0.568 0.573845 ## NW 4.204e+00 6.481e+00 0.649 0.521279 ## U1 -5.827e+03 4.210e+03 -1.384 0.176238 ## U2 1.678e+02 8.234e+01 2.038 0.050161 . ## Wealth 9.617e-02 1.037e-01 0.928 0.360754 ## Ineq 7.067e+01 2.272e+01 3.111 0.003983 ** ## Prob -4.855e+03 2.272e+03 -2.137 0.040627 * ## Time -3.479e+00 7.165e+00 -0.486 0.630708 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 209.1 on 31 degrees of freedom ## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078 ## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07 #step wise backward regression using p value back <- ols_step_backward_p(model, pent = 0.05 , prem = 0.05 , progress = TRUE) ## Backward Elimination Method ## --------------------------- ## ## Candidate Terms: ## ## 1 . M ## 2 . So ## 3 . Ed ## 4 . Po1 ## 5 . Po2 ## 6 . LF ## 7 . M.F ## 8 . Pop ## 9 . NW ## 10 . U1 ## 11 . U2 ## 12 . Wealth ## 13 . Ineq ## 14 . Prob ## 15 . Time ## ## We are eliminating variables based on p value... ##
## Variables Removed: ## ## - So ## - Time ## - LF ## - NW ## - Po2 ## - Pop ## - Wealth ## - M.F ## - U1 ## ## No more variables satisfy the condition of p value = 0.05 ## ## ## Final Model Output ## ------------------ ## ## Model Summary ## ----------------------------------------------------------------- ## R 0.875 RMSE 200.690 ## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## ----------------------------------------------------------------- ## RMSE: Root Mean Square Error ## MSE: Mean Square Error ## MAE: Mean Absolute Error ## ## ANOVA ## ----------------------------------------------------------------------- ## Sum of ## Squares DF Mean Square F Sig. ## ----------------------------------------------------------------------- ## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## ----------------------------------------------------------------------- ## ## Parameter Estimates ## -------------------------------------------------------------------------- ----------------------- ## model Beta Std. Error Std. Beta t Sig lower upper ## -------------------------------------------------------------------------- ----------------------- ## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## -------------------------------------------------------------------------- ----------------------- plot(back) Below is the linear model build by using the backward regression technique. Th eimportant predictors are M,Ed,Po1,U2,Ineq,Prob. back$model ## ## Call: ## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l) ## ## Coefficients: ## (Intercept) M Ed Po1 U2 In eq ## -5040.50 105.02 196.47 115.02 89.37 67. 65 ## Prob ## -3801.84
Similar to the above back ward regression technique, I also implemented the forward regression technique based of p value so that all attributes with p<0.05 were removed from the model. As you can see below, we got the same adjusted R2 of 0.731 for forward. I have also plotted the corresponding metrics like AIC,R2,Adj R2, CP, SBC,SBIC. #step wise forward regression using p value forward <- ols_step_forward_p(model, pent = 0.05 , prem = 0.05 , progress = TRUE) ## Forward Selection Method ## --------------------------- ## ## Candidate Terms: ## ## 1. M ## 2. So ## 3. Ed ## 4. Po1 ## 5. Po2 ## 6. LF ## 7. M.F ## 8. Pop ## 9. NW ## 10. U1 ## 11. U2 ## 12. Wealth ## 13. Ineq ## 14. Prob ## 15. Time ## ## We are selecting variables based on p value... ## ## Variables Entered: ## ## - Po1 ## - Ineq ## - Ed ## - M ## - Prob ## - U2 ## ## No more variables to be added. ## ## Final Model Output ## ------------------ ## ## Model Summary ## ----------------------------------------------------------------- ## R 0.875 RMSE 200.690
## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## ----------------------------------------------------------------- ## RMSE: Root Mean Square Error ## MSE: Mean Square Error ## MAE: Mean Absolute Error ## ## ANOVA ## ----------------------------------------------------------------------- ## Sum of ## Squares DF Mean Square F Sig. ## ----------------------------------------------------------------------- ## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## ----------------------------------------------------------------------- ## ## Parameter Estimates ## -------------------------------------------------------------------------- ----------------------- ## model Beta Std. Error Std. Beta t Sig lower upper ## -------------------------------------------------------------------------- ----------------------- ## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## -------------------------------------------------------------------------- ----------------------- #plotting the different measures like aic,r2,adjr2,cp,sbc,sbic plot(forward)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Below is the model and the attributes used after applying stepwise forward regression. forward$model ## ## Call: ## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l) ## ## Coefficients: ## (Intercept) Po1 Ineq Ed M Pr ob ## -5040.50 115.02 67.65 196.47 105.02 -3801. 84 ## U2 ## 89.37 Next to see if there will be any change if we used forward and backward stepwise togeather , I implemented the both regression technique based of p value so that all attributes with p<0.05 were removed from the model. As you can see below, we got the same adjusted R2 of 0.731 again no change. I have also plotted the corresponding metrics like AIC,R2,Adj R2, CP, SBC,SBIC.
#step wise both regression using p value k <- ols_step_both_p(model, pent = 0.05 , prem = 0.05 , progress = TRUE) ## Stepwise Selection Method ## --------------------------- ## ## Candidate Terms: ## ## 1. M ## 2. So ## 3. Ed ## 4. Po1 ## 5. Po2 ## 6. LF ## 7. M.F ## 8. Pop ## 9. NW ## 10. U1 ## 11. U2 ## 12. Wealth ## 13. Ineq ## 14. Prob ## 15. Time ## ## We are selecting variables based on p value... ## ## Variables Entered/Removed: ## ## - Po1 added ## - Ineq added ## - Ed added ## - M added ## - Prob added ## - U2 added ## ## No more variables to be added/removed. ## ## ## Final Model Output ## ------------------ ## ## Model Summary ## ----------------------------------------------------------------- ## R 0.875 RMSE 200.690 ## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## ----------------------------------------------------------------- ## RMSE: Root Mean Square Error ## MSE: Mean Square Error
## MAE: Mean Absolute Error ## ## ANOVA ## ----------------------------------------------------------------------- ## Sum of ## Squares DF Mean Square F Sig. ## ----------------------------------------------------------------------- ## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## ----------------------------------------------------------------------- ## ## Parameter Estimates ## -------------------------------------------------------------------------- ----------------------- ## model Beta Std. Error Std. Beta t Sig lower upper ## -------------------------------------------------------------------------- ----------------------- ## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## -------------------------------------------------------------------------- ----------------------- #plotting the different measures like aic,r2,adjr2,cp,sbc,sbic plot(k)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Below is the final model using the suggested variables. #building final model using only the variables as suggested by stepwise regre ssion k$model ## ## Call: ## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l) ## ## Coefficients: ## (Intercept) Po1 Ineq Ed M Pr ob ## -5040.50 115.02 67.65 196.47 105.02 -3801. 84 ## U2 ## 89.37 To check if not looking at the p values and use another metric like AIC can improve the model. I implemented the forward and backward together but this time based on AIC. I have also plotted the variables suggested by using aic model below. As you can see below, we got the same adjusted R2 of 0.7307 for this as well once I implemented the model with the suggested variables.
#step wise both regression using aic value k1 <- ols_step_both_aic(model) plot(k1) #Both give p value and aic similar variables for prediction final_model = lm(Crime~Po1+Ineq+Ed+M+Prob+U2, data= df) summary(final_model) ## ## Call: ## lm(formula = Crime ~ Po1 + Ineq + Ed + M + Prob + U2, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -470.68 -78.41 -19.68 133.12 556.23 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5040.50 899.84 -5.602 1.72e-06 *** ## Po1 115.02 13.75 8.363 2.56e-10 *** ## Ineq 67.65 13.94 4.855 1.88e-05 *** ## Ed 196.47 44.75 4.390 8.07e-05 *** ## M 105.02 33.30 3.154 0.00305 ** ## Prob -3801.84 1528.10 -2.488 0.01711 * ## U2 89.37 40.91 2.185 0.03483 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 200.7 on 40 degrees of freedom ## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307 ## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11 As we can observe all the techniques gave the same results and the model variance explanation increased by 3% by using the reduced no. of variables 11.1.2 Lasso Regression (alpha=1) This is also a technique for variable selection. I have used the glmet package as suggested. I split my data into train and test to check if the model performance. Initially I ran the cv.glmnet to find the optimal threshold to implement lasso. Since cv.glmnet inherently has glmnet in it. I then predicted my test data using the optimal lambda value and got a better model which gave R2 value of around 0.0765. The plot suggests that there are 6 attributes which are yields the least mean square error for this model. #removing all environment variables and importing libraries rm( list= ls()) library(glmnet) ## Loading required package: Matrix ## Loaded glmnet 4.1-4 #importing data into a df df = read.table( "USCrime.txt" , header= TRUE) vars <- model.matrix(Crime~., data = df)[,- 1 ] response <- df$Crime # Splitting the data into test and train set.seed( 101 ) train = sample( 1 :nrow(vars), nrow(vars)/ 2 ) x_test = (-train) y_test = response[x_test] #For Lasso alpha=1, to find the optimal value of lambda cv_lasso <- cv.glmnet(vars[train,], response[train], family= 'gaussian' , alpha = 1 , nlambda = 25 , type.measure = 'mse' , nfolds = 3 ) plot(cv_lasso)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
When we see the coefficients of the model we notice how all the variables which are less relevant to explain the variance have been shrunk to 0 and the model only uses the remaining coef to predict. # identifying best lamda best_lam <- cv_lasso$lambda.min best_lam ## [1] 22.84079 # predicting glmnet the model with best lamda value identified pred <- predict(cv_lasso, s = best_lam, newx = vars[x_test,]) coef(cv_lasso, s= 'lambda.min' ) ## 16 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) -1203.2315772 ## M 62.3809880 ## So 62.1505537 ## Ed 40.5885099 ## Po1 56.3101897 ## Po2 . ## LF . ## M.F . ## Pop . ## NW 0.7876006
## U1 . ## U2 7.0318148 ## Wealth . ## Ineq . ## Prob . ## Time 5.8429018 final <- cbind(y_test, pred) # Checking the first six obs head(final) ## y_test s1 ## 2 1635 908.6346 ## 4 1969 1183.7273 ## 5 1234 922.1215 ## 7 963 722.0461 ## 8 1555 948.5296 ## 10 705 681.4241 #calculating the R2 value actual <- y_test rss <- sum((pred - actual) ^ 2 ) tss <- sum((actual - mean(actual)) ^ 2 ) rsq <- 1 - rss/tss rsq ## [1] 0.07654224 11.1.3 Elastic Net (alpha=0.5) Similar to Lasso, even in Elastic net I used the glmnet package. After the usual train test split I ran the cv.glmnet to get the optimal lambda value. When we see the plot below we can understand that the Elastic net suggests that we need 12 variables to get the least mean square error. Once, we used this model to predict our test data we see that the R2 on test data had fallen down to 0.0215 #removing all environment variables and importing libraries rm( list= ls()) library(glmnet) #importing data into a df df = read.table( "USCrime.txt" , header= TRUE) inputs <- model.matrix(Crime~., data = df)[,- 1 ] response <- df$Crime # Splitting the data into test and train set.seed( 51 ) train = sample( 1 :nrow(inputs), nrow(inputs)/ 2 ) x_test = (-train) y_test = response[x_test]
#For Elastic we need to find the optimal value of lambda cv_elastic <- cv.glmnet(inputs[train,], response[train], family= 'gaussian' , alpha = 0.5 , nlambda = 25 , type.measure = 'mse' , nfolds = 3 ) plot(cv_elastic) When we see the coefficients of the model we notice how all the variables which are less relevant to explain the variance have been shrunk to 0 and the model only uses the remaining coef to predict. # identifying best lamda best_lam <- cv_elastic$lambda.min best_lam ## [1] 19.99938 pred <- predict(cv_elastic, s = best_lam, newx = inputs[x_test,]) coef(cv_elastic, s= "lambda.min" ) ## 16 x 1 sparse Matrix of class "dgCMatrix" ## s1 ## (Intercept) -6337.443614 ## M 12.465075 ## So . ## Ed 152.864071 ## Po1 45.910346 ## Po2 26.195996 ## LF -62.382724 ## M.F 39.254080 ## Pop . ## NW 15.773117 ## U1 -176.699076 ## U2 137.505163 ## Wealth .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Ineq 43.290013 ## Prob -6669.213248 ## Time -3.926477 final <- cbind(y_test, pred) # Checking the first six obs head(final) ## y_test s1 ## 3 578 505.5841 ## 5 1234 982.3946 ## 6 682 776.0849 ## 7 963 1013.2904 ## 9 856 671.9770 ## 11 1674 1136.7451 #calculating the R2 value actual <- y_test rss <- sum((pred - actual) ^ 2 ) tss <- sum((actual - mean(actual)) ^ 2 ) rsq <- 1 - rss/tss rsq ## [1] 0.2159214 In summary, the stepwise and Lasso regression worked better on this data set when compared to Elastic net. This might be because of the highly correlated variables in the data set which might have caused the elastic net to work poorly as it is a combination of lasso and ridge which is why it retained most of the variables i.e., 12/15 to get the least mean square error.