Homework 2 Solution

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

432

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by ProfessorRain13506

Report
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 1/9 Stat 432 Homework 2 Solution Assigned: Jan 24, 2021; Due: 11:59 PM CT, Feb 3, 2021 Instruction Question 1 (linear regression) Question 2 (model selection) Instruction Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity (https://studentcode.illinois.edu/article1/part4/1-401/). Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope (https://www.gradescope.com/). No email or hardcopy will be accepted. For late submission policy and grading rubrics (https://teazrq.github.io/stat432/syllabus.html), please refer to the course website. You are required to submit the rendered file HWx_yourNetID.pdf . For example, HW01_rqzhu.pdf . Please note that this must be a .pdf file. .html format cannot be accepted. Make all of your R code chunks visible for grading. Include your Name and NetID in the report. If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section. Your .Rmd file should be written such that, if it is placed in a folder with any data you utilize, it will knit properly without modifications. Make sure that you set seed properly so that the results can be replicated. For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything. Question 1 (linear regression) Let’s use the real estate data as an example. The data can be obtained from our course website. a. Construct a new categorical variable called season into the real estate dataset. You should utilize the original variable date to perform this task and read the definition provided in our lecture notes. The season variable should be defined as: spring (Mar - May), summer (Jun - Aug), fall (Sep - Nov), and winter (Dec - Feb). Show a summary table to demonstrate that your variable conversion is correct. Answer: We process the data value by creating a reference table of month and match results into the table.
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 2/9 realestate = read.csv("realestate.csv", row.names = 1) # Construct a reference table month = realestate$date - floor(realestate$date) valuetable = sort(unique(month)) # match date values into the table as seasons realestate$season = ifelse(month % in % valuetable[4:6], "spring", ifelse(month % in % valuetable[7:9], "summer", ifelse(month % in % valuetable[10:12], "fall", "winter"))) # convert character to factor realestate$season = as.factor(realestate$season) # a new table showing the match result table(realestate$date, realestate$season) ## ## fall spring summer winter ## 2012.667 0 0 30 0 ## 2012.75 27 0 0 0 ## 2012.833 31 0 0 0 ## 2012.917 38 0 0 0 ## 2013 0 0 0 28 ## 2013.083 0 0 0 46 ## 2013.167 0 0 0 25 ## 2013.25 0 32 0 0 ## 2013.333 0 29 0 0 ## 2013.417 0 58 0 0 ## 2013.5 0 0 47 0 ## 2013.583 0 0 23 0 The cross-table shows the correct matching between month values and the new season values. b. Split your data into two parts: a testing data that contains 100 observations, and the rest as training data. For this question, you need to set a random seed while generating this split so that the result can be replicated. Use your UIN as the random seed . Report the mean price of your testing data and training data, respectively. Answer: We randomly sample 100 observations as the testing data, and the rest as training data. set.seed(1) testid = sample(1:nrow(realestate), 100) train = realestate[-testid, ] test = realestate[testid, ] mean(train$price)
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 3/9 ## [1] 37.47357 mean(test$price) ## [1] 39.571 c. Use the training dataset to perform a linear regression. The goal is to model price with season , age , distance and stores . Make sure to treat season as a categorical variable. Use this model to calculate both the training error and the testing error (i.e., prediction error), using the mean squared error: Which error is higher? Is this expected? Answer: First we fit the linear regression. lm.fit = lm(price ~ age + distance + season + stores, data = train) summary(lm.fit) ## ## Call: ## lm(formula = price ~ age + distance + season + stores, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.882 -5.236 -1.293 4.577 77.578 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.1832647 1.7458358 23.589 < 2e-16 *** ## age -0.2933397 0.0467334 -6.277 1.18e-09 *** ## distance -0.0051956 0.0004983 -10.427 < 2e-16 *** ## seasonspring 1.9259603 1.4199450 1.356 0.1760 ## seasonsummer 3.8545468 1.4950422 2.578 0.0104 * ## seasonwinter 2.4297464 1.4970757 1.623 0.1056 ## stores 1.2928859 0.2205653 5.862 1.18e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.078 on 307 degrees of freedom ## Multiple R-squared: 0.5676, Adjusted R-squared: 0.5591 ## F-statistic: 67.16 on 6 and 307 DF, p-value: < 2.2e-16
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 4/9 # training error mean(lm.fit$residuals^2) ## [1] 80.56778 # testing error pred = predict(lm.fit, test) mean( (test$price - pred)^2 ) ## [1] 90.38297 The testing error is slightly higher. This is expected since the training error can always be affected by over-fitting. d. In many R packages, factors are not allowed. You must convert the data intro numerical matrix. For factors with multiple categories, this is usually done by creating dummy variables. First, identify the reference category of season in your previous linear regression, then create three dummy variables for the other three seasons. By doing this, you can create a new design matrix with all numerical values that contains the same information as the original one. Perform linear regression using this new data matrix to predict . Is your training error the same as the previous question? Answer: In the previous lm() model, summer was the reference category. Hence, we create three new dummies. # construct the new data newdata = data.frame("age" = train$age, "distance" = train$distance, "stores" = train$stores, "spring" = train$season == "spring", "summer" = train$season == "summer", "winter" = train$season == "winter", "price" = train$price) # fit lm() summary(lm(price ~ ., data = newdata))
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 5/9 ## ## Call: ## lm(formula = price ~ ., data = newdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.882 -5.236 -1.293 4.577 77.578 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 41.1832647 1.7458358 23.589 < 2e-16 *** ## age -0.2933397 0.0467334 -6.277 1.18e-09 *** ## distance -0.0051956 0.0004983 -10.427 < 2e-16 *** ## stores 1.2928859 0.2205653 5.862 1.18e-08 *** ## springTRUE 1.9259603 1.4199450 1.356 0.1760 ## summerTRUE 3.8545468 1.4950422 2.578 0.0104 * ## winterTRUE 2.4297464 1.4970757 1.623 0.1056 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.078 on 307 degrees of freedom ## Multiple R-squared: 0.5676, Adjusted R-squared: 0.5591 ## F-statistic: 67.16 on 6 and 307 DF, p-value: < 2.2e-16 We can see that all parameter estimates are the same as the previous question. e. Using this new numerical data matrix, we can perform linear regression based on its formula: However, to perform this, we also need to add a column of 1 as the intercept term to the design matrix. Write an R code to properly define both and , and then perform the linear regression using the above formula. Check your training error with previous questions. You cannot use lm() in this question. Answer: We add a column of 1 to the design matrix and follow the formula to calculate : X = data.matrix(cbind(1, newdata[, -7])) y = newdata[, 7] b = solve(t(X) %*% X) %*% t(X) %*% y # training error mean((y - X %*% b)^2) ## [1] 80.56778 The parameter estimates and the training error match our previous results. Question 2 (model selection)
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 6/9 For this question, use the original six variables of the realestate data (without converting date to season ), and treat all of them as continuous variables . However, you should perform the same training/testing split as Question 1b). Fit models using the training data and use the testing data for evaluation. a. Calculate the Marrows’ criterion using the full model, i.e., with all variables included. Compare this result with a model that contains only age , distance and stores . Which model is better based on this criterion? Compare their corresponding testing errors. Does that match your expectation? If yes, explain why you expect this to happen. If not, what could be the causes? Answer: Calculate results for the full model. The model has 7 variables, including the intercept. # reset the dataset # split into training and testing realestate = read.csv("realestate.csv", row.names = 1) train = realestate[-testid, ] test = realestate[testid, ] # fit linear model lm.fit = lm(price~., data=train) p = 7 RSS = sum(residuals(lm.fit)^2) Cp = RSS + 2*p*summary(lm.fit)$sigma^2 Cp ## [1] 25032.13 pred = predict(lm.fit, test) mean( (test$price - pred)^2 ) ## [1] 82.12065 Then we calculate the statistic of the reduced model. # the sub-model lm.fit2 = lm(price~age+distance+stores, data=train) p = 4 RSS = sum(residuals(lm.fit2)^2) Cp = RSS + 2*p*summary(lm.fit2)$sigma^2 Cp ## [1] 26530.03 pred = predict(lm.fit2, test) mean( (test$price - pred)^2 )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 7/9 ## [1] 93.53929 Marrows’ aims at producing the best model for prediction. Hence, we are expecting the full model to give smaller prediction error. However, this can still vary because of the randomness of the data. b. Use the best subset selection with AIC criterion to obtain the best model. Perform the following steps: Properly define the design matrix and outcome Use the leaps package to report the best model for each model size. Use the AIC criterion to compare different models and select the best one. What variable(s) are included and removed from the best model? Use a plot to intuitively compare different model sizes. Based on the best model, calculate and report the prediction error on the testing data. Answer: library (leaps) # We need to specify the X matrix and outcome y vector RSSleaps = regsubsets(x = as.matrix(train[, 1:6]), y = train$price) sumleaps = summary(RSSleaps, matrix = T) sumleaps ## Subset selection object ## 6 Variables (and intercept) ## Forced in Forced out ## date FALSE FALSE ## age FALSE FALSE ## distance FALSE FALSE ## stores FALSE FALSE ## latitude FALSE FALSE ## longitude FALSE FALSE ## 1 subsets of each size up to 6 ## Selection Algorithm: exhaustive ## date age distance stores latitude longitude ## 1 ( 1 ) " " " " "*" " " " " " " ## 2 ( 1 ) " " "*" "*" " " " " " " ## 3 ( 1 ) " " "*" "*" "*" " " " " ## 4 ( 1 ) " " "*" "*" "*" "*" " " ## 5 ( 1 ) "*" "*" "*" "*" "*" " " ## 6 ( 1 ) "*" "*" "*" "*" "*" "*" # the model size modelsize = apply(sumleaps$which, 1, sum) n = nrow(train) # calculate AIC and plot AIC = n*log(sumleaps$rss/n) + modelsize*2 plot(modelsize, AIC, xlab = "Model Size")
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 8/9 The best model moves longitude from the full model and keeps all other covariates. fit = lm(price ~ .-longitude, data = train) pred = predict(fit, newdata = test) # prediction error mean((pred - test$price)^2) ## [1] 81.69408 The prediction error is 81.6940818. c. Use a step-wise regression with BIC to select the best model. Clearly state: What is your initial model? What is the upper/lower limit of the model? Are you doing forward/backward or both? Is your result the same as question b)? Provide a brief discussion about their similarity or dissimilarity and the reason for that.
5/4/22, 3:23 PM Stat 432 Homework 2 Solution file:///Users/michellejun/Downloads/stat432/hw2/Homework 2 Solution.html 9/9 lm.fit = lm(price~., data = train) # I am using age as the initial model # the lower limit is age, the upper limit is the full model # I use forward selection # you can compare this with the previous question step(lm(price~age, data=train), scope=list(upper=lm.fit, lower=~age), direction="forward", trace=0, k=log(nrow(train))) ## ## Call: ## lm(formula = price ~ age + distance + stores + latitude, data = train) ## ## Coefficients: ## (Intercept) age distance stores latitude ## -5.576e+03 -3.026e-01 -4.006e-03 1.197e+00 2.250e+02 This BIC model selects one less variable: date . This could due to several reasons. First, BIC uses a larger penalty, which usually ends up with a smaller model. Second, the step-wise regression does not always provides globally the best model. It could stuck at a local minimum.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help