STAT3032_001_HW3_Solution_S2023

docx

School

University of Minnesota-Twin Cities *

*We aren’t endorsed by this school

Course

3032

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

7

Uploaded by JudgeOxide10008

Report
STAT 3032 Regression and Correlated Data Homework 3 (Solution) Please show your work on each problem for full credit. A correct answer, unsupported by the necessary explanation , R code or output will receive very little if any credit. Your work needs to be organized in a reasonably neat and coherent way, and submitted as a pdf file on Canvas. Please do not share this handout outside the class. Problem 1 [10 pts] We took two random samples from the election dataset we used in the lecture. Each observation is a US county. These two samples are saved in the data files election20_samp1.csv (Sample1) and election20_samp2.csv (Sample2). Recall that we were trying to build a model that uses the percentage of voters that chose the Republican candidate in 2012 to predict the percentage of voters that chose the Republican candidate in 2016. The variables: per2012 : the percentage of voters that chose the Republican candidate in 2012 per2016 : the percentage of voters that chose the Republican candidate in 2016 (a)_[1 pts] Use R to generate two scatterplots (one for each sample) with perrep_2012 at the horizontal axis and perrep_2016 at the vertical axis. Please show your work. Solution: I have imported the two samples as dat1 and dat2. > plot(per2016 ~ per2012, data = dat1) > plot(per2016 ~ per2012, data = dat2) For Sample1 For Sample2
STAT 3032 Regression and Correlated Data (b)_[1 pts] Based on the scatterplots in Part (a), you will notice that per2012 and per2016 have a linear trend, although Sample2 has an unusual county that stands out. This unusual county has a per2012 value close to 90. We will keep this unusual county in the data for now. If the description above doesn’t match your scatterplots, something is wrong in Part (a). Fit two linear regression models (one for each sample) that use the 2012 percentage to predict the 2016 percentage. What are the equations of the fitted models? Please round your coefficients to the nearest thousandths and show your work. Solution: > mod1 = lm(per2016 ~ per2012, data = dat1) > mod1 Call: lm(formula = per2016 ~ per2012, data = dat1) Coefficients: (Intercept) per2012 10.8595 0.8822 > mod2 = lm(per2016 ~ per2012, data = dat2) > mod2 Call: lm(formula = per2016 ~ per2012, data = dat2) Coefficients: (Intercept) per2012 18.8997 0.7621 The fitted model based on Sample 1 is ^ rep 2016 = 10.86 + 0.88 per 2012 The fitted model based on Sample 2 is ^ per 2016 = 18.90 + 0.76 per 2012 (c)_[1 pt] Find the 90% confidence interval of β 1 (the slope of per2012 in the population model) based on Sample 1. You must use the confint( ) function. Please show your work. Solution: > confint(mod1, level=0.9) 5 % 95 % (Intercept) 3.0917141 18.627211 perrep_2012 0.7473272 1.017097 Our 90% confidence interval for the slope based on Sample1 is (0.75, 1.02).
STAT 3032 Regression and Correlated Data (d)_[3 pts] Find the 90% confidence interval of β 1 (the slope of per2012 in the population model) based on Sample 2. This time, you are not allowed to use the confint( ) function. Please show your work. Solution: First, we generate the summary of the fitted model from Sample2 > summary(mod2) Call: lm(formula = per2016 ~ per2012, data = dat2) Residuals: Min 1Q Median 3Q Max -9.7522 -1.2659 -0.0143 2.8760 6.8342 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.8997 7.2166 2.619 0.0174 * per2012 0.7621 0.1072 7.109 1.26e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.782 on 18 degrees of freedom Multiple R-squared: 0.7374, Adjusted R-squared: 0.7228 F-statistic: 50.54 on 1 and 18 DF, p-value: 1.262e-06 The realized value of ^ β 1 is 0.78 and estimated standard error of ^ β 1 is 0.12. For a 90% confidence interval, our critical value is t 0.95 ,df = 18 = 1.73 . We calculate this using > qt(.95, df =18,lower.tail = T) [1] 1.734064 Our confidence interval will be ^ β 1 ±t 0.95 ,df = 18 × ^ se ( ^ β 1 ) Assemble everything: Lower bound = 0.76 - 0.11 x 1.73 = 0.57 Upper bound = 0.76 + 0.11 x 1.73 = 0.95 Our 90% confidence interval for the slope based on Sample2 is (0.57, 0.95).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data (e)_[2 pts] Compare the intervals in Part (c) and Part (d). Which interval is wider? If both intervals are for β 1 (the slope of per2012 in the population model) and have the same confidence level (90%), how come that they have different widths? Please explain your answer. Solution: The width of the first interval is 0.26 (= 1.01 - 0.75 ). The width of the second interval is 0.38 (=0.95 - 0.57). The interval created from Sample2 is wider. Explanation 1: The two intervals have different widths because the endpoints for confidence intervals are random variables, and thus will vary from sample to sample. Explanation 2: The width of the intervals is 2 ×t 0.95 ,df = 18 × ^ se ( ^ β 1 ) . The two intervals have the same critical value, but different ^ se ( ^ β 1 ) since ^ se ( ^ β 1 ) is a statistic computed from the sample. There may be other reasonable explanations. (f)_[2 pts] From the lecture, we know that β 1 = 0.99 . You should discover that the confidence interval in Part (c) captures 0.99 but the confidence interval in Part (d) misses 0.99. If the above description is inconsistent with your intervals, something is wrong. Assume that both samples are randomly selected. How come that one of the 90% confidence intervals captures the true value of β 1 while the other misses it? Hint: what is the meaning of the confidence level, 90%? Solution: A 90% confidence level means that out of 100 different samples, we would expect on average 90 of the confidence intervals to capture the true slope parameter, or 9/10 of the time. Our interval in Part (c) is among the most of the intervals that capture the true slope value, while the interval in Part (d) is among the unfortunate 10% of the intervals that miss the true slope value. Problem 2 [10 pts] Download the data file Rateprof.csv from Canvas. The variables we will use are quality and clarity of a professor or their class. Both variables are ratings made by students on the scale of 1 to 5, with 1 being the worst, and 5 the best. (a)_[2 pts] Draw a scatterplot with clarity on the horizontal axis and quality on the vertical axis. Solution:
STAT 3032 Regression and Correlated Data > plot(quality~clarity, data=Rateprof) (b)_[2 pts] Based on this dataset, what is the fitted model that uses clarity to predict quality ? Please provide the equation and pay attention to the notation. Solution: > mod=lm(quality ~ clarity, data = Rateprof) > mod Coefficients: (Intercept) clarity 0.2210 0.9516 ^ quality = 0.2210 + 0.9516 ×clarity (c)_[2 pts] Based on the model fitted in Part (b), are there outliers? Please show your work. Use standardized residuals to identify potential outliers. Solution: > par(mfrow=c(2,2)) > plot(mod)
STAT 3032 Regression and Correlated Data Using the fourth plot, we can see that the 76th observation falls below -4. Alternatively, You can identify the outlier using the following method: > rlist = rstandard(mod) > which(rlist > 4) named integer(0) There is no observation with standardized residual above 4. > which(rlist < -4) 76 76 The 76th observation has a standardized residual below -4. Therefore only the 76th observation is the outlier. (d)_[2 pts] Based on the model fitted in Part (b), are there influential points? Please show your work. You may use 1 as the threshold for the Cook’s Distance. Solution: > cooksList = cooks.distance(mod) > sort(cooksList)[366] 76
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
STAT 3032 Regression and Correlated Data 0.129732 Even the largest Cook’s Distance is less than 1. Therefore there are no influential points. (e)_[2 pts] Let’s remove the outlier(s) you identified in Part (c) and refit the model. What is the equation of the fitted model? Compare the coefficients in the new model to those in the fitted model in Part (b). How different are they? Please show your work. Solution: > mod2=lm(quality~clarity, data=Rateprof[-76, ]) > mod2 Call: lm(formula = quality ~ clarity, data = Rateprof[-76, ]) Coefficients: (Intercept) clarity 0.2295 0.9506 The new fitted model is ^ quality = 0.2295 + 0.9506 ×clarity Compared to the fitted model in Part (b) ( ^ quality = 0.2210 + 0.9516 ×clarity ): the intercepts (0.2295 and 0.2210) are close. the slopes (0.9506 and 0.9516) are close.