hw2-sol(3)

pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

515

Subject

Mathematics

Date

Apr 3, 2024

Type

pdf

Pages

10

Uploaded by JusticeSkunkPerson786

Report
Hw2 - solutions Dr. Isuru Dassanayake Instructions: Include all codes and R-outputs that are necessary for each question. Submission: You MUST submit a PDF file alone with R-Markdown file (Optional) ##Question: Is there a relationship between student math test scores and socioeconomic variables? The dataset CASchools in the AER package contains data on test performance, school demographics, and student demographic background for school districts in California. library(AER) library(tidyverse) data( "CASchools" ) a). Fit a linear regression model to predict math test scores on all the quantitative variables in the dataset. Also, don’t use the read test score in your regression. Discuss your results, making sure to cover the following points: (40 points) # First let s clean the data set by removing unnecessary data columns head(CASchools) ## district school county grades students teachers ## 1 75119 Sunol Glen Unified Alameda KK-08 195 10.90 ## 2 61499 Manzanita Elementary Butte KK-08 240 11.15 ## 3 61549 Thermalito Union Elementary Butte KK-08 1550 82.90 ## 4 61457 Golden Feather Union Elementary Butte KK-08 243 14.00 ## 5 61523 Palermo Union Elementary Butte KK-08 1335 71.50 ## 6 62042 Burrel Union Elementary Fresno KK-08 137 6.40 ## calworks lunch computer expenditure income english read math ## 1 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6 690.0 ## 2 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5 661.9 ## 3 55.0323 76.3226 169 5501.955 8.978000 30.000002 636.3 650.9 ## 4 36.4754 77.0492 85 7101.831 8.978000 0.000000 651.9 643.5 ## 5 33.1086 78.4270 171 5235.988 9.080333 13.857677 641.8 639.9 ## 6 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7 605.4 ## most of the removed columns are strings data1 = CASchools %>% select(- district, -county, -grades, -school, -read) head(data1) # checking the data subset ## students teachers calworks lunch computer expenditure income english ## 1 195 10.90 0.5102 2.0408 67 6384.911 22.690001 0.000000 ## 2 240 11.15 15.4167 47.9167 101 5099.381 9.824000 4.583333 ## 3 1550 82.90 55.0323 76.3226 169 5501.955 8.978000 30.000002 ## 4 243 14.00 36.4754 77.0492 85 7101.831 8.978000 0.000000 ## 5 1335 71.50 33.1086 78.4270 171 5235.988 9.080333 13.857677 ## 6 137 6.40 12.3188 86.9565 25 5580.147 10.415000 12.408759 1
## math ## 1 690.0 ## 2 661.9 ## 3 650.9 ## 4 643.5 ## 5 639.9 ## 6 605.4 # fitting the first model fit1 = lm(math~., data = data1) summary(fit1) ## ## Call: ## lm(formula = math ~ ., data = data1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -31.2893 -6.9982 0.2331 5.9637 30.3968 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.561e+02 4.487e+00 146.227 < 2e-16 *** ## students -6.938e-04 1.735e-03 -0.400 0.68941 ## teachers 5.077e-03 3.817e-02 0.133 0.89426 ## calworks -1.330e-01 6.834e-02 -1.947 0.05226 . ## lunch -3.306e-01 4.273e-02 -7.735 8.05e-14 *** ## computer 4.541e-03 3.266e-03 1.390 0.16519 ## expenditure 9.776e-04 8.645e-04 1.131 0.25877 ## income 6.953e-01 1.057e-01 6.576 1.47e-10 *** ## english -1.471e-01 4.097e-02 -3.591 0.00037 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 9.93 on 411 degrees of freedom ## Multiple R-squared: 0.725, Adjusted R-squared: 0.7197 ## F-statistic: 135.5 on 8 and 411 DF, p-value: < 2.2e-16 (i). What do you observe about the relationship between these predictors and math test scores? According to the fitted model the most significant variables in predicting the math test scores are lunch, income and english. The variable calworks is somewhat important at 10% significance level. (ii). Are there any insignificant variables? The variables students, teachers, computer and expenditure are not significant. (iii). Which predictor is explaining math scores the best? Variable lunch is the best at explaining the math test scores, because it has the lowest p-value among other variables. (iv). Explain the interpretation of the coefficient on lunch in terms of the response variable. one percentage point higher of proportion students qualifying for reduced-price lunch leads to a reduction in test score of 3.3 points. b). Do you think you can fit a better model than the model in part (a)? Fit a new model that you think is a better fit for the data. What do you observe about this new model? (After this point answers can vary 2
form student to student. If they have a valid reasoning behind the their work give points. 10 points) There are several different ways of answering this question. i). can remove the insignificant variables and fit a better model. ii). can include the most significant variables only. iii). can use Best subset selection and come up with a better model. iv). can use forward/backward selection methods to fit a better model. Any method will work for this and some of these method might yield same results. Here I am using best subset selection library(leaps) # fitting all possible models for different number of variables in it. regfit.full = regsubsets(math~.,data1) ( reg.summary = summary(regfit.full)) # summary of the outputs ## Subset selection object ## Call: regsubsets.formula(math ~ ., data1) ## 8 Variables (and intercept) ## Forced in Forced out ## students FALSE FALSE ## teachers FALSE FALSE ## calworks FALSE FALSE ## lunch FALSE FALSE ## computer FALSE FALSE ## expenditure FALSE FALSE ## income FALSE FALSE ## english FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## students teachers calworks lunch computer expenditure income english ## 1 ( 1 ) " " " " " " "*" " " " " " " " " ## 2 ( 1 ) " " " " " " "*" " " " " "*" " " ## 3 ( 1 ) " " " " " " "*" " " " " "*" "*" ## 4 ( 1 ) " " " " "*" "*" " " " " "*" "*" ## 5 ( 1 ) " " " " "*" "*" " " "*" "*" "*" ## 6 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" ## 7 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" ## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" which.min(reg.summary$rss) ## identify the location of the minimum RSS ## [1] 8 which.max(reg.summary$adjr2) ## identify the location of the maximum Adj. R_sq ## [1] 7 which.min(reg.summary$cp) ## identify the location of the minimum Cp ## [1] 4 which.min(reg.summary$bic) ## identify the location of the minimum bic value. ## [1] 3 #After careful consideration of all 4 different values from above, #according to the lowest BIC value the model with 3 variables is a better model. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#(error rates and other factors are almost similar with different models, #but this model has only 3 variables which is less complex ) # the variables as follows. coef(regfit.full, 3 ) ## (Intercept) lunch income english ## 660.6932413 -0.3761052 0.7494847 -0.1278717 # re-creating the model with only 3 variables fit2 = lm(math~lunch+income+english, data = data1) summary(fit2) ## ## Call: ## lm(formula = math ~ lunch + income + english, data = data1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -31.878 -6.847 0.069 5.909 31.071 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 660.69324 2.46316 268.229 < 2e-16 *** ## lunch -0.37611 0.03192 -11.782 < 2e-16 *** ## income 0.74948 0.09536 7.859 3.34e-14 *** ## english -0.12787 0.03628 -3.525 0.000471 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 9.95 on 416 degrees of freedom ## Multiple R-squared: 0.7205, Adjusted R-squared: 0.7185 ## F-statistic: 357.5 on 3 and 416 DF, p-value: < 2.2e-16 c). Write the equation for the least square fit from part b). (5 points) ˆ math = 660 . 96 - 0 . 3761 * lunch + 0 . 7494 * income - 0 . 1278 * english d). Compare the R2 and RSE for models from part (a) and (b). (comparison between the models would be fine 10 points) fit1: RSE= 9.93 and adj R-sq = 0.7197 fit2: RSE = 9.95 and adj R-sq = 0.7187 Additional: here when comparing the F-statistics from both models fit1 and fit2, it can be seen that fit2 F-stat is much larger than fit1 F- stat. Larger F-statistic is an indicates higher overall model significance. Therefore fit2 is a better model than fit1, although Adj R-sq and RSE values are higher in fit2 than fit1. It can be shown from the following test. anova(fit1, fit2) ## Analysis of Variance Table ## ## Model 1: math ~ students + teachers + calworks + lunch + computer + expenditure + ## income + english 4
## Model 2: math ~ lunch + income + english ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 411 40523 ## 2 416 41184 -5 -660.24 1.3393 0.2466 # Higher p-value indicates that the model with high number of variables is not significant than the mode e). Using a residuals plot for part a and part b models, state the appropriateness of using a linear model for this problem. (should create the residual plots, at least the first one 10 points) #Residual plots for fit 1 par( mfrow = c( 2 , 2 )) plot(fit1, main = "plots for fit1" ) 620 640 660 680 700 -30 0 30 plots for fit1 Fitted values Residuals Residuals vs Fitted 6 419 180 -3 -2 -1 0 1 2 3 -3 0 2 plots for fit1 Theoretical Quantiles Standardized residuals Normal Q-Q 6 180 419 620 640 660 680 700 0.0 1.0 plots for fit1 Fitted values Standardized residuals Scale-Location 6 180 419 0.00 0.10 0.20 0.30 -2 2 plots for fit1 Leverage Standardized residuals Cook's distance 0.5 0.5 Residuals vs Leverage 135 6 49 #Residual plots for fit2 plot(fit2, main = "plots for fit2" ) 5
620 640 660 680 700 -20 20 plots for fit2 Fitted values Residuals Residuals vs Fitted 180 419 273 -3 -2 -1 0 1 2 3 -3 0 2 plots for fit2 Theoretical Quantiles Standardized residuals Normal Q-Q 180 419 273 620 640 660 680 700 0.0 1.0 plots for fit2 Fitted values Standardized residuals Scale-Location 180 419 273 0.00 0.02 0.04 0.06 0.08 0.10 -4 0 plots for fit2 Leverage Standardized residuals Cook's distance Residuals vs Leverage 417 6 387 According to residual plots from both fitted, models fit1 and fit2 it can be seen that the linear models are appropriate for this data set. There are some few data points that deviates from linearity. f). Incorporate an interaction term into your multiple linear regression and respond to the following: (can have any interaction term added here, but when they include an interaction term them MUST include the main effects ( the base variables as well) , reduce points if they haven’t done that. 15 points) For this question students can incorporate any interaction term that they like to include. But students must provide a valid justification for including that particular interaction term. When including an interaction term they should include the main effects as well, regardless of how insignificant they are. (Hierarchical Rule). # Here as an example, I am going to include two separate models with different #interaction terms. #interaction of expenditure and income might have an relationship and can have #an interaction effect on math scores fit3 = lm(math~lunch+income+english+ expenditure+ expenditure*income, data = data1) summary(fit3) ## ## Call: ## lm(formula = math ~ lunch + income + english + expenditure + ## expenditure * income, data = data1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -32.622 -6.380 0.059 5.976 30.570 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## (Intercept) 6.755e+02 9.512e+00 71.016 < 2e-16 *** ## lunch -3.952e-01 3.338e-02 -11.840 < 2e-16 *** ## income -3.730e-01 5.197e-01 -0.718 0.47329 ## english -1.209e-01 3.657e-02 -3.306 0.00103 ** ## expenditure -2.215e-03 1.636e-03 -1.353 0.17665 ## income:expenditure 1.797e-04 8.444e-05 2.128 0.03389 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 9.909 on 414 degrees of freedom ## Multiple R-squared: 0.7241, Adjusted R-squared: 0.7208 ## F-statistic: 217.4 on 5 and 414 DF, p-value: < 2.2e-16 #interaction of lunch and income might have an relationship and can have an #interaction effect on math scores, because lunch is a selection for low incomes fit4 = lm(math~lunch+income+english + lunch*income, data= data1) summary(fit4) ## ## Call: ## lm(formula = math ~ lunch + income + english + lunch * income, ## data = data1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -33.109 -6.629 0.179 6.199 30.546 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 660.516346 2.450995 269.489 < 2e-16 *** ## lunch -0.298659 0.045731 -6.531 1.92e-10 *** ## income 0.811946 0.098493 8.244 2.22e-15 *** ## english -0.130111 0.036096 -3.605 0.000351 *** ## lunch:income -0.007636 0.003245 -2.353 0.019084 * ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 ## ## Residual standard error: 9.896 on 415 degrees of freedom ## Multiple R-squared: 0.7242, Adjusted R-squared: 0.7216 ## F-statistic: 272.5 on 4 and 415 DF, p-value: < 2.2e-16 (i).Provide analytical justification for your interaction term. (ii). Compare your results with the models from parts a) and b). compare the models over here. fit3 and fit4 have an increased adj R-sq values compared to fit1 and fit2. g). Fit a ‘Ridge Regression’ Model for the dataset and interpret the results. (Fitting a ridge regression would be enough 10 Points) library(glmnet) ## Loading required package: Matrix ## ## Attaching package: Matrix ## The following objects are masked from package:tidyr : 7
## ## expand, pack, unpack ## Loaded glmnet 4.1-2 x = model.matrix(math~., data1)[,- 1 ] # creating model matrix with predictor variables y = data1$math # saving the response variable grid = 10 ˆseq( 10 ,- 2 , length= 100 ) ## grid of lambda values to choose from # creating training and testing dataset. 80% for training and 20% testing set.seed( 1 ) train = sample( 1 :nrow(x), nrow(x)* 0.8 ) test = (-train) y.test = y[test] # fitting the ridge regression model ridge.mod = glmnet(x[train,],y[train], alpha= 0 , lambda= grid, thresh = 1e-12 ) ## deciding on a optimal lambda value by cross validation. set.seed ( 1 ) cv.out = cv.glmnet(x[train ,],y[train], alpha= 0 ) plot(cv.out) ## plot of MSE values against log(lambda) 2 4 6 8 100 150 200 250 300 350 Log ( λ 29 Mean-Squared Error 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 bestlam = cv.out$lambda.min bestlam; # the best lambda value out of the grid ## [1] 1.50185 8
# testing the fitted model with best lambda value ridge.pred = predict(ridge.mod, s= bestlam , newx= x[test,]) mean((ridge.pred-y.test)ˆ 2 ) # MSE for the model ## [1] 124.7387 # Finally refit our ridge regression model on the full data set with the best lambda out = glmnet(x,y, alpha= 0 ) predict(out, type= "coefficients" , s= bestlam)[ 1 : 9 ,] ## (Intercept) students teachers calworks lunch ## 654.085476458 -0.000161436 -0.001877834 -0.198167386 -0.273949447 ## computer expenditure income english ## 0.002911589 0.001086239 0.712658457 -0.175896886 According to the Ridge regression model, it can be seen that model uses all the variables in the model and the variables teachers, students, computers and expenditure are closer to zero. h). Fit a ‘Lasso Regression’ model for the dataset and interpret the results. (Fitting a lasso regression would be enough 10 Points) #fitting a lasso model lasso.mod = glmnet(x[train,], y[train], alpha= 1 , lambda= grid) # perform cross-validation to select the best lambda set.seed ( 1 ) cv.out = cv.glmnet(x[train,],y[train], alpha= 1 ) plot(cv.out) ## plot of MSE values against log(lambda) -4 -3 -2 -1 0 1 2 100 150 200 250 300 350 Log ( λ 29 Mean-Squared Error 8 8 8 8 7 7 7 6 5 4 4 4 3 3 2 2 2 1 1 bestlam = cv.out$lambda.min bestlam # best lambda value ## [1] 0.4804818 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
lasso.pred = predict(lasso.mod, s= bestlam , newx= x[test,]) mean((lasso.pred-y.test)ˆ 2 ) # slightly larger than ridge ## [1] 127.2308 # Finally refit our lasso regression model on the full data set with the best lambda out = glmnet(x,y, alpha= 1 , lambda= grid) lasso.coef = predict(out, type= "coefficients" , s= bestlam)[ 1 : 9 ,] lasso.coef ## (Intercept) students teachers calworks lunch ## 6.597432e+02 0.000000e+00 0.000000e+00 -7.433561e-02 -3.497275e-01 ## computer expenditure income english ## 0.000000e+00 2.885583e-04 6.894050e-01 -1.188187e-01 We can use lasso model as a variable selection tool. According to the output we can remove the variables students, teachers and computer. Although error rates a tad higher than the ridge regression model, lasso model has less variables. 10