hw2-sol(3)
pdf
keyboard_arrow_up
School
George Mason University *
*We aren’t endorsed by this school
Course
515
Subject
Mathematics
Date
Apr 3, 2024
Type
Pages
10
Uploaded by JusticeSkunkPerson786
Hw2 - solutions
Dr. Isuru Dassanayake
Instructions: Include all codes and R-outputs that are necessary for each question. Submission: You MUST
submit a PDF file alone with R-Markdown file (Optional)
##Question:
Is there a relationship between student math test scores and socioeconomic variables? The dataset CASchools
in the AER package contains data on test performance, school demographics, and student demographic
background for school districts in California.
library(AER)
library(tidyverse)
data(
"CASchools"
)
a). Fit a linear regression model to predict math test scores on all the quantitative variables in the dataset.
Also, don’t use the read test score in your regression. Discuss your results, making sure to cover the following
points:
(40 points)
# First let s clean the data set by removing unnecessary data columns
head(CASchools)
##
district
school
county grades students teachers
## 1
75119
Sunol Glen Unified Alameda
KK-08
195
10.90
## 2
61499
Manzanita Elementary
Butte
KK-08
240
11.15
## 3
61549
Thermalito Union Elementary
Butte
KK-08
1550
82.90
## 4
61457 Golden Feather Union Elementary
Butte
KK-08
243
14.00
## 5
61523
Palermo Union Elementary
Butte
KK-08
1335
71.50
## 6
62042
Burrel Union Elementary
Fresno
KK-08
137
6.40
##
calworks
lunch computer expenditure
income
english
read
math
## 1
0.5102
2.0408
67
6384.911 22.690001
0.000000 691.6 690.0
## 2
15.4167 47.9167
101
5099.381
9.824000
4.583333 660.5 661.9
## 3
55.0323 76.3226
169
5501.955
8.978000 30.000002 636.3 650.9
## 4
36.4754 77.0492
85
7101.831
8.978000
0.000000 651.9 643.5
## 5
33.1086 78.4270
171
5235.988
9.080333 13.857677 641.8 639.9
## 6
12.3188 86.9565
25
5580.147 10.415000 12.408759 605.7 605.4
## most of the removed columns are strings
data1
=
CASchools %>% select(- district, -county, -grades, -school, -read)
head(data1)
# checking the data subset
##
students teachers calworks
lunch computer expenditure
income
english
## 1
195
10.90
0.5102
2.0408
67
6384.911 22.690001
0.000000
## 2
240
11.15
15.4167 47.9167
101
5099.381
9.824000
4.583333
## 3
1550
82.90
55.0323 76.3226
169
5501.955
8.978000 30.000002
## 4
243
14.00
36.4754 77.0492
85
7101.831
8.978000
0.000000
## 5
1335
71.50
33.1086 78.4270
171
5235.988
9.080333 13.857677
## 6
137
6.40
12.3188 86.9565
25
5580.147 10.415000 12.408759
1
##
math
## 1 690.0
## 2 661.9
## 3 650.9
## 4 643.5
## 5 639.9
## 6 605.4
# fitting the first model
fit1
=
lm(math~.,
data =
data1)
summary(fit1)
##
## Call:
## lm(formula = math ~ ., data = data1)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -31.2893
-6.9982
0.2331
5.9637
30.3968
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
6.561e+02
4.487e+00 146.227
< 2e-16 ***
## students
-6.938e-04
1.735e-03
-0.400
0.68941
## teachers
5.077e-03
3.817e-02
0.133
0.89426
## calworks
-1.330e-01
6.834e-02
-1.947
0.05226 .
## lunch
-3.306e-01
4.273e-02
-7.735 8.05e-14 ***
## computer
4.541e-03
3.266e-03
1.390
0.16519
## expenditure
9.776e-04
8.645e-04
1.131
0.25877
## income
6.953e-01
1.057e-01
6.576 1.47e-10 ***
## english
-1.471e-01
4.097e-02
-3.591
0.00037 ***
## ---
## Signif. codes:
0
***
0.001
**
0.01
*
0.05
.
0.1
1
##
## Residual standard error: 9.93 on 411 degrees of freedom
## Multiple R-squared:
0.725,
Adjusted R-squared:
0.7197
## F-statistic: 135.5 on 8 and 411 DF,
p-value: < 2.2e-16
(i). What do you observe about the relationship between these predictors and math test scores?
According to the fitted model the most significant variables in predicting the math test scores are lunch,
income and english. The variable calworks is somewhat important at 10% significance level.
(ii). Are there any insignificant variables?
The variables students, teachers, computer and expenditure are not significant.
(iii). Which predictor is explaining math scores the best?
Variable lunch is the best at explaining the math test scores, because it has the lowest p-value among other
variables.
(iv). Explain the interpretation of the coefficient on lunch in terms of the response variable.
one percentage point higher of proportion students qualifying for reduced-price lunch leads to a reduction in
test score of 3.3 points.
b). Do you think you can fit a better model than the model in part (a)? Fit a new model that you think is a
better fit for the data. What do you observe about this new model?
(After this point answers can vary
2
form student to student. If they have a valid reasoning behind the their work give points. 10
points)
There are several different ways of answering this question.
i). can remove the insignificant variables
and fit a better model.
ii). can include the most significant variables only.
iii). can use Best subset
selection and come up with a better model.
iv). can use forward/backward selection methods to fit a better
model.
Any method will work for this and some of these method might yield same results. Here I am using best subset
selection
library(leaps)
# fitting all possible models for different number of variables in it.
regfit.full
=
regsubsets(math~.,data1)
(
reg.summary =
summary(regfit.full))
# summary of the outputs
## Subset selection object
## Call: regsubsets.formula(math ~ ., data1)
## 8 Variables
(and intercept)
##
Forced in Forced out
## students
FALSE
FALSE
## teachers
FALSE
FALSE
## calworks
FALSE
FALSE
## lunch
FALSE
FALSE
## computer
FALSE
FALSE
## expenditure
FALSE
FALSE
## income
FALSE
FALSE
## english
FALSE
FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##
students teachers calworks lunch computer expenditure income english
## 1
( 1 ) " "
" "
" "
"*"
" "
" "
" "
" "
## 2
( 1 ) " "
" "
" "
"*"
" "
" "
"*"
" "
## 3
( 1 ) " "
" "
" "
"*"
" "
" "
"*"
"*"
## 4
( 1 ) " "
" "
"*"
"*"
" "
" "
"*"
"*"
## 5
( 1 ) " "
" "
"*"
"*"
" "
"*"
"*"
"*"
## 6
( 1 ) "*"
" "
"*"
"*"
"*"
" "
"*"
"*"
## 7
( 1 ) "*"
" "
"*"
"*"
"*"
"*"
"*"
"*"
## 8
( 1 ) "*"
"*"
"*"
"*"
"*"
"*"
"*"
"*"
which.min(reg.summary$rss)
## identify the location of the minimum RSS
## [1] 8
which.max(reg.summary$adjr2)
## identify the location of the maximum Adj. R_sq
## [1] 7
which.min(reg.summary$cp)
## identify the location of the minimum Cp
## [1] 4
which.min(reg.summary$bic)
## identify the location of the minimum bic value.
## [1] 3
#After careful consideration of all 4 different values from above,
#according to the lowest BIC value the model with 3 variables is a better model.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
#(error rates and other factors are almost similar with different models,
#but this model has only 3 variables which is less complex )
# the variables as follows.
coef(regfit.full,
3
)
## (Intercept)
lunch
income
english
## 660.6932413
-0.3761052
0.7494847
-0.1278717
# re-creating the model with only 3 variables
fit2
=
lm(math~lunch+income+english,
data =
data1)
summary(fit2)
##
## Call:
## lm(formula = math ~ lunch + income + english, data = data1)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -31.878
-6.847
0.069
5.909
31.071
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept) 660.69324
2.46316 268.229
< 2e-16 ***
## lunch
-0.37611
0.03192 -11.782
< 2e-16 ***
## income
0.74948
0.09536
7.859 3.34e-14 ***
## english
-0.12787
0.03628
-3.525 0.000471 ***
## ---
## Signif. codes:
0
***
0.001
**
0.01
*
0.05
.
0.1
1
##
## Residual standard error: 9.95 on 416 degrees of freedom
## Multiple R-squared:
0.7205, Adjusted R-squared:
0.7185
## F-statistic: 357.5 on 3 and 416 DF,
p-value: < 2.2e-16
c). Write the equation for the least square fit from part b).
(5 points)
ˆ
math
= 660
.
96
-
0
.
3761
*
lunch
+ 0
.
7494
*
income
-
0
.
1278
*
english
d). Compare the R2 and RSE for models from part (a) and (b).
(comparison between the models would
be fine 10 points)
fit1: RSE= 9.93 and adj R-sq = 0.7197
fit2: RSE = 9.95 and adj R-sq = 0.7187
Additional:
here when comparing the F-statistics from both models fit1 and fit2, it can be seen that fit2
F-stat is much larger than fit1 F- stat.
Larger F-statistic is an indicates higher overall model significance.
Therefore fit2 is a better model than fit1, although Adj R-sq and RSE values are higher in fit2 than fit1. It
can be shown from the following test.
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure +
##
income + english
4
## Model 2: math ~ lunch + income + english
##
Res.Df
RSS Df Sum of Sq
F Pr(>F)
## 1
411 40523
## 2
416 41184 -5
-660.24 1.3393 0.2466
# Higher p-value indicates that the model with high number of variables is not significant than the mode
e). Using a residuals plot for part a and part b models, state the appropriateness of using a linear model for
this problem.
(should create the residual plots, at least the first one 10 points)
#Residual plots for fit 1
par(
mfrow =
c(
2
,
2
))
plot(fit1,
main =
"plots for fit1"
)
620
640
660
680
700
-30
0
30
plots for fit1
Fitted values
Residuals
Residuals vs Fitted
6
419
180
-3
-2
-1
0
1
2
3
-3
0
2
plots for fit1
Theoretical Quantiles
Standardized residuals
Normal Q-Q
6 180
419
620
640
660
680
700
0.0
1.0
plots for fit1
Fitted values
Standardized residuals
Scale-Location
6
180
419
0.00
0.10
0.20
0.30
-2
2
plots for fit1
Leverage
Standardized residuals
Cook's distance
0.5
0.5
Residuals vs Leverage
135
6
49
#Residual plots for fit2
plot(fit2,
main =
"plots for fit2"
)
5
620
640
660
680
700
-20
20
plots for fit2
Fitted values
Residuals
Residuals vs Fitted
180
419
273
-3
-2
-1
0
1
2
3
-3
0
2
plots for fit2
Theoretical Quantiles
Standardized residuals
Normal Q-Q
180
419
273
620
640
660
680
700
0.0
1.0
plots for fit2
Fitted values
Standardized residuals
Scale-Location
180
419
273
0.00
0.02
0.04
0.06
0.08
0.10
-4
0
plots for fit2
Leverage
Standardized residuals
Cook's distance
Residuals vs Leverage
417
6
387
According to residual plots from both fitted, models fit1 and fit2 it can be seen that the linear models are
appropriate for this data set. There are some few data points that deviates from linearity.
f). Incorporate an interaction term into your multiple linear regression and respond to the following:
(can
have any interaction term added here, but when they include an interaction term them MUST
include the main effects ( the base variables as well) , reduce points if they haven’t done that.
15 points)
For this question students can incorporate any interaction term that they like to include. But students
must provide a valid justification for including that particular interaction term. When including an interaction
term they should include the main effects as well, regardless of how insignificant they are. (Hierarchical Rule).
# Here as an example, I am going to include two separate models with different
#interaction terms.
#interaction of expenditure and income might have an relationship and can have
#an interaction effect on math scores
fit3
=
lm(math~lunch+income+english+ expenditure+ expenditure*income,
data =
data1)
summary(fit3)
##
## Call:
## lm(formula = math ~ lunch + income + english + expenditure +
##
expenditure * income, data = data1)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -32.622
-6.380
0.059
5.976
30.570
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## (Intercept)
6.755e+02
9.512e+00
71.016
< 2e-16 ***
## lunch
-3.952e-01
3.338e-02 -11.840
< 2e-16 ***
## income
-3.730e-01
5.197e-01
-0.718
0.47329
## english
-1.209e-01
3.657e-02
-3.306
0.00103 **
## expenditure
-2.215e-03
1.636e-03
-1.353
0.17665
## income:expenditure
1.797e-04
8.444e-05
2.128
0.03389 *
## ---
## Signif. codes:
0
***
0.001
**
0.01
*
0.05
.
0.1
1
##
## Residual standard error: 9.909 on 414 degrees of freedom
## Multiple R-squared:
0.7241, Adjusted R-squared:
0.7208
## F-statistic: 217.4 on 5 and 414 DF,
p-value: < 2.2e-16
#interaction of lunch and income might have an relationship and can have an
#interaction effect on math scores, because lunch is a selection for low incomes
fit4
=
lm(math~lunch+income+english + lunch*income,
data=
data1)
summary(fit4)
##
## Call:
## lm(formula = math ~ lunch + income + english + lunch * income,
##
data = data1)
##
## Residuals:
##
Min
1Q
Median
3Q
Max
## -33.109
-6.629
0.179
6.199
30.546
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
660.516346
2.450995 269.489
< 2e-16 ***
## lunch
-0.298659
0.045731
-6.531 1.92e-10 ***
## income
0.811946
0.098493
8.244 2.22e-15 ***
## english
-0.130111
0.036096
-3.605 0.000351 ***
## lunch:income
-0.007636
0.003245
-2.353 0.019084 *
## ---
## Signif. codes:
0
***
0.001
**
0.01
*
0.05
.
0.1
1
##
## Residual standard error: 9.896 on 415 degrees of freedom
## Multiple R-squared:
0.7242, Adjusted R-squared:
0.7216
## F-statistic: 272.5 on 4 and 415 DF,
p-value: < 2.2e-16
(i).Provide analytical justification for your interaction term.
(ii). Compare your results with the models from parts a) and b).
compare the models over here. fit3 and fit4 have an increased adj R-sq values compared to fit1 and fit2.
g). Fit a ‘Ridge Regression’ Model for the dataset and interpret the results.
(Fitting a ridge regression
would be enough 10 Points)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package:
Matrix
## The following objects are masked from
package:tidyr :
7
##
##
expand, pack, unpack
## Loaded glmnet 4.1-2
x
=
model.matrix(math~., data1)[,-
1
]
# creating model matrix with predictor variables
y
=
data1$math
# saving the response variable
grid
=
10
ˆseq(
10
,-
2
,
length=
100
)
## grid of lambda values to choose from
# creating training and testing dataset. 80% for training and 20% testing
set.seed(
1
)
train
=
sample(
1
:nrow(x), nrow(x)*
0.8
)
test
=
(-train)
y.test
=
y[test]
# fitting the ridge regression model
ridge.mod
=
glmnet(x[train,],y[train],
alpha=
0
,
lambda=
grid,
thresh =
1e-12
)
## deciding on a optimal lambda value by cross validation.
set.seed (
1
)
cv.out
=
cv.glmnet(x[train ,],y[train],
alpha=
0
)
plot(cv.out)
## plot of MSE values against log(lambda)
2
4
6
8
100
150
200
250
300
350
Log
(
λ
29
Mean-Squared Error
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
bestlam
=
cv.out$lambda.min
bestlam;
# the best lambda value out of the grid
## [1] 1.50185
8
# testing the fitted model with best lambda value
ridge.pred
=
predict(ridge.mod,
s=
bestlam ,
newx=
x[test,])
mean((ridge.pred-y.test)ˆ
2
)
# MSE for the model
## [1] 124.7387
# Finally refit our ridge regression model on the full data set with the best lambda
out
=
glmnet(x,y,
alpha=
0
)
predict(out,
type=
"coefficients"
,
s=
bestlam)[
1
:
9
,]
##
(Intercept)
students
teachers
calworks
lunch
## 654.085476458
-0.000161436
-0.001877834
-0.198167386
-0.273949447
##
computer
expenditure
income
english
##
0.002911589
0.001086239
0.712658457
-0.175896886
According to the Ridge regression model, it can be seen that model uses all the variables in the model and
the variables teachers, students, computers and expenditure are closer to zero.
h). Fit a ‘Lasso Regression’ model for the dataset and interpret the results.
(Fitting a lasso regression
would be enough 10 Points)
#fitting a lasso model
lasso.mod
=
glmnet(x[train,], y[train],
alpha=
1
,
lambda=
grid)
# perform cross-validation to select the best lambda
set.seed (
1
)
cv.out
=
cv.glmnet(x[train,],y[train],
alpha=
1
)
plot(cv.out)
## plot of MSE values against log(lambda)
-4
-3
-2
-1
0
1
2
100
150
200
250
300
350
Log
(
λ
29
Mean-Squared Error
8
8
8
8
7
7
7
6
5
4
4
4
3
3
2
2
2
1
1
bestlam
=
cv.out$lambda.min
bestlam
# best lambda value
## [1] 0.4804818
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
lasso.pred
=
predict(lasso.mod,
s=
bestlam ,
newx=
x[test,])
mean((lasso.pred-y.test)ˆ
2
)
# slightly larger than ridge
## [1] 127.2308
# Finally refit our lasso regression model on the full data set with the best lambda
out
=
glmnet(x,y,
alpha=
1
,
lambda=
grid)
lasso.coef
=
predict(out,
type=
"coefficients"
,
s=
bestlam)[
1
:
9
,]
lasso.coef
##
(Intercept)
students
teachers
calworks
lunch
##
6.597432e+02
0.000000e+00
0.000000e+00 -7.433561e-02 -3.497275e-01
##
computer
expenditure
income
english
##
0.000000e+00
2.885583e-04
6.894050e-01 -1.188187e-01
We can use lasso model as a variable selection tool. According to the output we can remove the variables
students, teachers and computer. Although error rates a tad higher than the ridge regression model, lasso
model has less variables.
10