HW_8
pdf
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
6501
Subject
Aerospace Engineering
Date
Dec 6, 2023
Type
Pages
16
Uploaded by DrCrocodile2618
HW8 11
. Using the crime data set uscrime.txt from Questions 8.2, 9.1, and 10.1, build a regression model using: 1. Stepwise regression 2. Lasso 3. Elastic net 11.1.1 Stepwise Regression Below is the implementation of Stepwise regression. I have used the olsrr package in R to do this. Initially I build a linear regression model with all the predictors to check how the model is doing and as you can see below, I got an R2 value of 0.7078. Then to improve the model and applied stepwise regression based on p value (backward). Any predictors with p<0.05 were excluded from the model and the R2 value went up to 0.731. I also plotted the various metrics for the model and they are also shown in the table output below. #removing all environment variables and importing libraries
rm(
list=
ls())
library(olsrr)
## ## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## ## rivers
#importing data into a df
df
=
read.table(
"USCrime.txt"
,
header=
TRUE)
#apply linear regression model on all the variables in the dataset
model
=
lm(Crime~.,
data=
df)
#model summary
summary(model)
## ## Call:
## lm(formula = Crime ~ ., data = df)
## ## Residuals:
## Min 1Q Median 3Q Max ## -395.74 -98.09 -6.69 112.99 512.67 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 * ## So -3.803e+00 1.488e+02 -0.026 0.979765 ## Ed 1.883e+02 6.209e+01 3.033 0.004861 ** ## Po1 1.928e+02 1.061e+02 1.817 0.078892 . ## Po2 -1.094e+02 1.175e+02 -0.931 0.358830 ## LF -6.638e+02 1.470e+03 -0.452 0.654654 ## M.F 1.741e+01 2.035e+01 0.855 0.398995 ## Pop -7.330e-01 1.290e+00 -0.568 0.573845 ## NW 4.204e+00 6.481e+00 0.649 0.521279 ## U1 -5.827e+03 4.210e+03 -1.384 0.176238 ## U2 1.678e+02 8.234e+01 2.038 0.050161 . ## Wealth 9.617e-02 1.037e-01 0.928 0.360754 ## Ineq 7.067e+01 2.272e+01 3.111 0.003983 ** ## Prob -4.855e+03 2.272e+03 -2.137 0.040627 * ## Time -3.479e+00 7.165e+00 -0.486 0.630708 ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078 ## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
#step wise backward regression using p value
back
<-
ols_step_backward_p(model,
pent = 0.05
,
prem = 0.05
,
progress = TRUE)
## Backward Elimination Method ## ---------------------------
## ## Candidate Terms: ## ## 1 . M ## 2 . So ## 3 . Ed ## 4 . Po1 ## 5 . Po2 ## 6 . LF ## 7 . M.F ## 8 . Pop ## 9 . NW ## 10 . U1 ## 11 . U2 ## 12 . Wealth ## 13 . Ineq ## 14 . Prob ## 15 . Time ## ## We are eliminating variables based on p value...
##
## Variables Removed: ## ## - So ## - Time ## - LF ## - NW ## - Po2 ## - Pop ## - Wealth ## - M.F ## - U1 ## ## No more variables satisfy the condition of p value = 0.05
## ## ## Final Model Output ## ------------------
## ## Model Summary ## -----------------------------------------------------------------
## R 0.875 RMSE 200.690 ## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## -----------------------------------------------------------------
## RMSE: Root Mean Square Error ## MSE: Mean Square Error ## MAE: Mean Absolute Error ## ## ANOVA ## -----------------------------------------------------------------------
## Sum of ## Squares DF Mean Square F Sig. ## -----------------------------------------------------------------------
## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## -----------------------------------------------------------------------
## ## Parameter Estimates ## --------------------------------------------------------------------------
-----------------------
## model Beta Std. Error Std. Beta t Sig lower upper ## --------------------------------------------------------------------------
-----------------------
## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## --------------------------------------------------------------------------
-----------------------
plot(back)
Below is the linear model build by using the backward regression technique. Th eimportant predictors are M,Ed,Po1,U2,Ineq,Prob. back$model
## ## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l)
## ## Coefficients:
## (Intercept) M Ed Po1 U2 In
eq ## -5040.50 105.02 196.47 115.02 89.37 67.
65 ## Prob ## -3801.84
Similar to the above back ward regression technique, I also implemented the forward regression technique based of p value so that all attributes with p<0.05 were removed from the model. As you can see below, we got the same adjusted R2 of 0.731 for forward. I have also plotted the corresponding metrics like AIC,R2,Adj R2, CP, SBC,SBIC. #step wise forward regression using p value
forward
<-
ols_step_forward_p(model,
pent = 0.05
,
prem = 0.05
,
progress = TRUE)
## Forward Selection Method ## ---------------------------
## ## Candidate Terms: ## ## 1. M ## 2. So ## 3. Ed ## 4. Po1 ## 5. Po2 ## 6. LF ## 7. M.F ## 8. Pop ## 9. NW ## 10. U1 ## 11. U2 ## 12. Wealth ## 13. Ineq ## 14. Prob ## 15. Time ## ## We are selecting variables based on p value...
## ## Variables Entered: ## ## - Po1 ## - Ineq ## - Ed ## - M ## - Prob ## - U2 ## ## No more variables to be added.
## ## Final Model Output ## ------------------
## ## Model Summary ## -----------------------------------------------------------------
## R 0.875 RMSE 200.690
## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## -----------------------------------------------------------------
## RMSE: Root Mean Square Error ## MSE: Mean Square Error ## MAE: Mean Absolute Error ## ## ANOVA ## -----------------------------------------------------------------------
## Sum of ## Squares DF Mean Square F Sig. ## -----------------------------------------------------------------------
## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## -----------------------------------------------------------------------
## ## Parameter Estimates ## --------------------------------------------------------------------------
-----------------------
## model Beta Std. Error Std. Beta t Sig lower upper ## --------------------------------------------------------------------------
-----------------------
## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## --------------------------------------------------------------------------
----------------------- #plotting the different measures like aic,r2,adjr2,cp,sbc,sbic plot(forward)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Below is the model and the attributes used after applying stepwise forward regression. forward$model
## ## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l)
## ## Coefficients:
## (Intercept) Po1 Ineq Ed M Pr
ob ## -5040.50 115.02 67.65 196.47 105.02 -3801.
84 ## U2 ## 89.37 Next to see if there will be any change if we used forward and backward stepwise togeather , I implemented the both regression technique based of p value so that all attributes with p<0.05 were removed from the model. As you can see below, we got the same adjusted R2 of 0.731 again no change. I have also plotted the corresponding metrics like AIC,R2,Adj R2, CP, SBC,SBIC.
#step wise both regression using p value
k
<-
ols_step_both_p(model,
pent = 0.05
,
prem = 0.05
,
progress = TRUE)
## Stepwise Selection Method ## ---------------------------
## ## Candidate Terms: ## ## 1. M ## 2. So ## 3. Ed ## 4. Po1 ## 5. Po2 ## 6. LF ## 7. M.F ## 8. Pop ## 9. NW ## 10. U1 ## 11. U2 ## 12. Wealth ## 13. Ineq ## 14. Prob ## 15. Time ## ## We are selecting variables based on p value...
## ## Variables Entered/Removed: ## ## - Po1 added ## - Ineq added ## - Ed added ## - M added ## - Prob added ## - U2 added ## ## No more variables to be added/removed.
## ## ## Final Model Output ## ------------------
## ## Model Summary ## -----------------------------------------------------------------
## R 0.875 RMSE 200.690 ## R-Squared 0.766 Coef. Var 22.174 ## Adj. R-Squared 0.731 MSE 40276.421 ## Pred R-Squared 0.666 MAE 138.674 ## -----------------------------------------------------------------
## RMSE: Root Mean Square Error ## MSE: Mean Square Error
## MAE: Mean Absolute Error ## ## ANOVA ## -----------------------------------------------------------------------
## Sum of ## Squares DF Mean Square F Sig. ## -----------------------------------------------------------------------
## Regression 5269870.803 6 878311.801 21.807 0.0000 ## Residual 1611056.856 40 40276.421 ## Total 6880927.660 46 ## -----------------------------------------------------------------------
## ## Parameter Estimates ## --------------------------------------------------------------------------
-----------------------
## model Beta Std. Error Std. Beta t Sig lower upper ## --------------------------------------------------------------------------
-----------------------
## (Intercept) -5040.505 899.843 -5.602 0.000 -6859.156 -3221.854 ## Po1 115.024 13.754 0.884 8.363 0.000 87.227 142.821 ## Ineq 67.653 13.936 0.698 4.855 0.000 39.488 95.818 ## Ed 196.471 44.754 0.568 4.390 0.000 106.019 286.923 ## M 105.020 33.299 0.341 3.154 0.003 37.719 172.320 ## Prob -3801.836 1528.097 -0.224 -2.488 0.017 -6890.236 -713.436 ## U2 89.366 40.906 0.195 2.185 0.035 6.693 172.039 ## --------------------------------------------------------------------------
-----------------------
#plotting the different measures like aic,r2,adjr2,cp,sbc,sbic
plot(k)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Below is the final model using the suggested variables. #building final model using only the variables as suggested by stepwise regre
ssion
k$model
## ## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), ## data = l)
## ## Coefficients:
## (Intercept) Po1 Ineq Ed M Pr
ob ## -5040.50 115.02 67.65 196.47 105.02 -3801.
84 ## U2 ## 89.37 To check if not looking at the p values and use another metric like AIC can improve the model. I implemented the forward and backward together but this time based on AIC. I have also plotted the variables suggested by using aic model below. As you can see below, we got the same adjusted R2 of 0.7307 for this as well once I implemented the model with the suggested variables.
#step wise both regression using aic value
k1
<-
ols_step_both_aic(model)
plot(k1)
#Both give p value and aic similar variables for prediction
final_model
=
lm(Crime~Po1+Ineq+Ed+M+Prob+U2,
data=
df)
summary(final_model)
## ## Call:
## lm(formula = Crime ~ Po1 + Ineq + Ed + M + Prob + U2, data = df)
## ## Residuals:
## Min 1Q Median 3Q Max ## -470.68 -78.41 -19.68 133.12 556.23 ## ## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5040.50 899.84 -5.602 1.72e-06 ***
## Po1 115.02 13.75 8.363 2.56e-10 ***
## Ineq 67.65 13.94 4.855 1.88e-05 ***
## Ed 196.47 44.75 4.390 8.07e-05 ***
## M 105.02 33.30 3.154 0.00305 ** ## Prob -3801.84 1528.10 -2.488 0.01711 * ## U2 89.37 40.91 2.185 0.03483 * ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307 ## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
As we can observe all the techniques gave the same results and the model variance explanation increased by 3% by using the reduced no. of variables 11.1.2 Lasso Regression (alpha=1) This is also a technique for variable selection. I have used the glmet package as suggested. I split my data into train and test to check if the model performance. Initially I ran the cv.glmnet to find the optimal threshold to implement lasso. Since cv.glmnet inherently has glmnet in it. I then predicted my test data using the optimal lambda value and got a better model which gave R2 value of around 0.0765. The plot suggests that there are 6 attributes which are yields the least mean square error for this model. #removing all environment variables and importing libraries
rm(
list=
ls())
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
#importing data into a df
df
=
read.table(
"USCrime.txt"
,
header=
TRUE)
vars
<-
model.matrix(Crime~.,
data = df)[,-
1
]
response
<-
df$Crime
# Splitting the data into test and train
set.seed(
101
)
train = sample(
1
:nrow(vars), nrow(vars)/
2
)
x_test = (-train)
y_test = response[x_test]
#For Lasso alpha=1, to find the optimal value of lambda
cv_lasso <- cv.glmnet(vars[train,], response[train],
family=
'gaussian'
,
alpha = 1
, nlambda = 25
,
type.measure = 'mse'
,
nfolds = 3
)
plot(cv_lasso)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
When we see the coefficients of the model we notice how all the variables which are less relevant to explain the variance have been shrunk to 0 and the model only uses the remaining coef to predict. # identifying best lamda
best_lam <- cv_lasso$lambda.min
best_lam
## [1] 22.84079
# predicting glmnet the model with best lamda value identified
pred
<-
predict(cv_lasso, s = best_lam, newx = vars[x_test,])
coef(cv_lasso,
s=
'lambda.min'
)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1203.2315772
## M 62.3809880
## So 62.1505537
## Ed 40.5885099
## Po1 56.3101897
## Po2 . ## LF . ## M.F . ## Pop . ## NW 0.7876006
## U1 . ## U2 7.0318148
## Wealth . ## Ineq . ## Prob . ## Time 5.8429018
final <- cbind(y_test, pred)
# Checking the first six obs
head(final)
## y_test s1
## 2 1635 908.6346
## 4 1969 1183.7273
## 5 1234 922.1215
## 7 963 722.0461
## 8 1555 948.5296
## 10 705 681.4241
#calculating the R2 value
actual <- y_test
rss <- sum((pred - actual) ^ 2
)
tss <- sum((actual - mean(actual)) ^ 2
)
rsq <- 1 - rss/tss
rsq
## [1] 0.07654224
11.1.3 Elastic Net (alpha=0.5) Similar to Lasso, even in Elastic net I used the glmnet package. After the usual train test split I ran the cv.glmnet to get the optimal lambda value. When we see the plot below we can understand that the Elastic net suggests that we need 12 variables to get the least mean square error. Once, we used this model to predict our test data we see that the R2 on test data had fallen down to 0.0215 #removing all environment variables and importing libraries
rm(
list=
ls())
library(glmnet)
#importing data into a df
df
=
read.table(
"USCrime.txt"
,
header=
TRUE)
inputs
<-
model.matrix(Crime~.,
data = df)[,-
1
]
response
<-
df$Crime
# Splitting the data into test and train
set.seed(
51
)
train = sample(
1
:nrow(inputs), nrow(inputs)/
2
)
x_test = (-train)
y_test = response[x_test]
#For Elastic we need to find the optimal value of lambda
cv_elastic <- cv.glmnet(inputs[train,], response[train],
family=
'gaussian'
,
alpha = 0.5
, nlambda = 25
,
type.measure = 'mse'
,
nfolds = 3
)
plot(cv_elastic)
When we see the coefficients of the model we notice how all the variables which are less relevant to explain the variance have been shrunk to 0 and the model only uses the remaining coef to predict. # identifying best lamda
best_lam <- cv_elastic$lambda.min
best_lam
## [1] 19.99938
pred <- predict(cv_elastic, s = best_lam, newx = inputs[x_test,])
coef(cv_elastic,
s=
"lambda.min"
)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -6337.443614
## M 12.465075
## So . ## Ed 152.864071
## Po1 45.910346
## Po2 26.195996
## LF -62.382724
## M.F 39.254080
## Pop . ## NW 15.773117
## U1 -176.699076
## U2 137.505163
## Wealth .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## Ineq 43.290013
## Prob -6669.213248
## Time -3.926477
final <- cbind(y_test, pred)
# Checking the first six obs
head(final)
## y_test s1
## 3 578 505.5841
## 5 1234 982.3946
## 6 682 776.0849
## 7 963 1013.2904
## 9 856 671.9770
## 11 1674 1136.7451
#calculating the R2 value
actual <- y_test
rss <- sum((pred - actual) ^ 2
)
tss <- sum((actual - mean(actual)) ^ 2
)
rsq <- 1 - rss/tss
rsq
## [1] 0.2159214 In summary, the stepwise and Lasso regression worked better on this data set when compared to Elastic net. This might be because of the highly correlated variables in the data set which might have caused the elastic net to work poorly as it is a combination of lasso and ridge which is why it retained most of the variables i.e., 12/15 to get the least mean square error.