HW2

pdf

School

Washington State University *

*We aren’t endorsed by this school

Course

MISC

Subject

Economics

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by GrandDeerMaster1061

Report
#Q1. Do Question 14. of Chapter 3 of the ISLR book. (Page 125). #This problem focuses on the collinearity problem. #(a) Perform the following commands in R: set.seed(1) x1=runif (100) x2=0.5*x1+rnorm (100)/10 y=2+2*x1+0.3*x2+rnorm (100) #The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients? Y=2+2X1+0.3X2+E The regression coefficients are B0 = 2, B1 = 2 and B3 = 0.3 #(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables. cor(x1, x2) plot(x1, x2, xlab = "x1", ylab = "x2") #(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ0, ˆ1, and ˆ2? How do these relate to the true 0, 1, and 2? β β β β β β Can you reject the null hypothesis H0 : 1 = 0? How about the null hypothesis H0 : 2 = 0? β β model <- lm(y ~ x1 + x2) summary(model) coefficients <- coef(model) cat("Intercept (βˆ0):", coefficients[1], "\n") cat("Coefficient for x1 (βˆ1):", coefficients[2], "\n") cat("Coefficient for x2 (βˆ2):", coefficients[3], "\n") true_beta0 <- 2 true_beta1 <- 2 true_beta2 <- 0.3 cat("True β0:", true_beta0, "\n") cat("True β1:", true_beta1, "\n") cat("True β2:", true_beta2, "\n") p_value_x1 <- summary(model)$coefficients["x1", "Pr(>|t|)"] if (p_value_x1 < 0.05) { cat("Reject Null Hypothesis H0: β1 = 0 (x1 is a significant predictor)\n") } else { cat("Cannot reject Null Hypothesis H0: β1 = 0 (x1 is not a significant predictor)\n") }
p_value_x2 <- summary(model)$coefficients["x2", "Pr(>|t|)"] if (p_value_x2 < 0.05) { cat("Reject Null Hypothesis H0: β2 = 0 (x2 is a significant predictor)\n") } else { cat("Cannot reject Null Hypothesis H0: β2 = 0 (x2 is not a significant predictor)\n") } #(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : 1 = 0? β model_x1 <- lm(y ~ x1) summary(model_x1) p_value_x1 <- summary(model_x1)$coefficients["x1", "Pr(>|t|)"] if (p_value_x1 < 0.05) { cat("Reject Null Hypothesis H0: β1 = 0 (x1 is a significant predictor of y)\n") } else { cat("Cannot reject Null Hypothesis H0: β1 = 0 (x1 is not a significant predictor of y)\n") } #(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : 1 = 0? β model_x2 <- lm(y ~ x2) summary(model_x2) p_value_x2 <- summary(model_x2)$coefficients["x2", "Pr(>|t|)"] if (p_value_x2 < 0.05) { cat("Reject Null Hypothesis H0: β1 = 0 (x2 is a significant predictor of y)\n") } else { cat("Cannot reject Null Hypothesis H0: β1 = 0 (x2 is not a significant predictor of y)\n") } #(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer. The results are not contradictory. They stem from a statistical issue called collinearity. Collinearity happens when two predictors, “x1” and “x2” in this case, are closely related and provide similar information for predicting the outcome. Collinearity makes it difficult to separately understand the impact of each predictor on the outcome.It also makes our estimates less precise.And as a result, when collinearity is present, we might struggle to confidently determine how important “x2” is in predicting the outcome, even if it is indeed important. Collinearity complicates the situation and makes it appear more uncertain.
#(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured. #x1=c(x1, 0.1) #x2=c(x2, 0.8) #y=c(y,6) #Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers. x1_new <- c(x1, 0.1) x2_new <- c(x2, 0.8) y_new <- c(y, 6) model_c_new <- lm(y_new ~ x1_new + x2_new) summary(model_c_new) model_d_new <- lm(y_new ~ x1_new) summary(model_d_new) model_e_new <- lm(y_new ~ x2_new) summary(model_e_new) par(mfrow=c(2,2)) plot(model_c_new) plot(model_d_new) plot(model_e_new) In the model for x1 and x2 and the model for x2 only, the final point is high-leverage. But the model with x1 only isn’t high leverage. #Q2. Before attempting this question, set the seed number in R by using set.seed(1) to ensure consistent results #a. Simulate a training data set of n = 25 observations as y = exp(x) + E where x and E are generated via a normal distribution with mean zero and standard deviation one. (use rnorm() to simulate these variables). Then do the following, set.seed(1) n <- 25 x <- rnorm(n, mean = 0, sd = 1) E <- rnorm(n, mean = 0, sd = 1) y <- exp(x) + E simulated_data <- data.frame(x, y) head(simulated_data)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
#b. Fit the following four linear regression models to the above training data set (using the lm() function in R), (i) y = 0 + 1x + E, (ii) y = 0 + 1x + 2x^2 + E, (iii) y = 0 + 1x + β β β β β β β 2x^2 + 3x^3 + E, (iv) y = 0 + 1x + 2x^2 + 3x^3 + 4x^4 + E. β β β β β β β data <- simulated_data model1 <- lm(y ~ x, data = data) model2 <- lm(y ~ x + I(x^2), data = data) model3 <- lm(y ~ x + I(x^2) + I(x^3), data = data) model4 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4), data = data) summary(model1) summary(model2) summary(model3) summary(model4) #c. Now simulate a testing data set with n = 500 observations from the model in part (a), by generating new values of x and E. n_test <- 500 x_test <- rnorm(n_test, mean = 0, sd = 1) E_test <- rnorm(n_test, mean = 0, sd = 1) y_test <- exp(x_test) + E_test testing_data <- data.frame(x_test, y_test) head(testing_data) #d. Use the estimated coefficients in Part (b) to compute the test error, i.e. the MSE = 1/n ∑(yi − yˆi)^2 of the testing data set for each of the four models computed in part (b). model1_test <- lm(y_test ~ x_test, data = data_test) model2_test <- lm(y_test ~ x_test + I(x_test^2), data = data_test) model3_test <- lm(y_test ~ x_test + I(x_test^2) + I(x_test^3), data = data_test) model4_test <- lm(y_test ~ x_test + I(x_test^2) + I(x_test^3) + I(x_test^4), data = data_test) y_pred_model1_test <- predict(model1_test, newdata = data_test) y_pred_model2_test <- predict(model2_test, newdata = data_test) y_pred_model3_test <- predict(model3_test, newdata = data_test) y_pred_model4_test <- predict(model4_test, newdata = data_test)
mse_model1_test <- mean((data_test$y_test - y_pred_model1_test)^2) mse_model2_test <- mean((data_test$y_test - y_pred_model2_test)^2) mse_model3_test <- mean((data_test$y_test - y_pred_model3_test)^2) mse_model4_test <- mean((data_test$y_test - y_pred_model4_test)^2) cat("MSE for Model 1:", mse_model1_test, "\n") cat("MSE for Model 2:", mse_model2_test, "\n") cat("MSE for Model 3:", mse_model3_test, "\n") cat("MSE for Model 4:", mse_model4_test, "\n") #e. Based on your results of Part (b), which model would you reccommend as the ‘best fit model’? is the conclusion suprising? I’d choose model 2. I’d pick it because it has the highest R-squared value.It also includes a quadratic term and linear term. Anything more and I think it would be overkill. #Q3. Consider the Hitters data in the ISLR package, our objective here is to predict the salary variable as the response using the remaining variables. #a. Split the data into a training and testing data set. library(ISLR) data("Hitters") set.seed(1) split <- sample(nrow(Hitters), 0.75*nrow(Hitters)) training_data <- Hitters[split, ] testing_data <- Hitters[!split, ] training_data test_data #b. Fit a linear model using least squares on the training set and report the test error obtained. Test error obtained = 115647.6 lm.fit=lm(Salary~., data=train) summary(lm.fit) lm.pred=predict(lm.fit, newdata=test) lm.err = mean((test$Salary-lm.pred)^2) lm.err #c. Fit a ridge regression model on the training set, with chosen by cross-validation. λ Report the test error obtained. Test error obtained = 110234.3
library(glmnet) set.seed(1) xtrain = model.matrix(Salary~.,data= train[,-1]) ytrain = train$Salary xtest = model.matrix(Salary~.,data = test[,-1]) ytest = test$Salary ridge.mod=glmnet(xtrain,ytrain,alpha=0) set.seed(1) cv.out=cv.glmnet(xtrain,ytrain,alpha=0) plot(cv.out) bestlam =cv.out$lambda.min bestlam ridge.pred=predict (ridge.mod ,s=bestlam, newx=xtest) ridge.err = mean((ridge.pred -ytest)^2) ridge.err #d. Fit a lasso model on the training set, with chosen by cross validation. Report the test λ error obtained, along with the number of non-zero coefficients estimates. Test error obtained = 110165.7 lasso.mod=glmnet(xtrain,ytrain,alpha=1) set.seed(1) cv.out=cv.glmnet(xtrain,ytrain,alpha=1) plot(cv.out) bestlam =cv.out$lambda.min lasso.pred=predict (lasso.mod ,s=bestlam ,newx=xtest) lasso.err = mean((lasso.pred -ytest)^2) lasso.err #e. Comment on the results obtained. How accurately can we predict the salaries of players? Is there much difference among the test errors resulting from these three approaches? We can predict the salaries within reason.The Lasso model is very slightly more accurate than Ridge. But there isn’t much difference between the 3, with Lasso being the most accurate, Ridge being second, and OLS being third. barplot(c(lm.err, ridge.err, lasso.err), col="blue", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso"))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help