final_review_session

pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by lixun73230

STAT 151A Final Review Session December 6, 2023 This worksheet doesn’t include everything you need to review for the final exam. Please see the final study guide posted on bCourses for a more comprehensive list of concepts, examples and exercises. 1 Regression output For this question we consider the seatpos dataset from R. Here is a description from the R help file: Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers. We focus on a random subset of 33 drivers. The dataset contains the following variables Age (in years) Weight (in lbs) HtShoes (height in shoes in cm) Ht (Height bare foot in cm) Seated (Seated height in cm) Arm (lower arm length in cm) Thigh (Thigh length in cm) Leg (Lower leg length in cm) hipcenter (horizontal distance of the midpoint of the hips from a fixed location in the car in mm) Using the variables given in the dataset, you decide to create four new variables v 1, v 2, v 3 and v 4 via > v1 = seatpos$HtShoes - 171.3 > v2 = seatpos$Arm - 0.2252*seatpos$HtShoes + 6.346 > v3 = seatpos$Thigh - 0.1662*seatpos$HtShoes - 0.3376*seatpos$Arm + 0.6548 > v4 = seatpos$Leg - 0.2374*seatpos$HtShoes - 0.2280*seatpos$Arm + 0.0746*seatpos$Thigh + 8.872 You then fit the model hipcenter = β 0 + β 1 v 1 + β 2 v 2 + β 3 v 3 + β 4 v 4 + e to the data using R which gives the following output: 1

STAT 151A Final Review Session December 6, 2023 > model = lm(seatpos$hipcenter ~ v1 + v2 + v3 + v4) > summary(model) Call: lm(formula = seatpos$hipcenter ~ v1 + v2 + v3 + v4) Residuals: Min 1Q Median 3Q Max -89.50 -23.21 -4.82 24.41 60.39 Coefficients: Estimate Std. Error (Intercept) -163.201 6.654 v1 -4.206 0.580 v2 0.112 3.095 v3 -0.613 2.516 v4 -8.927 XXXX --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 38.23 on 28 degrees of freedom Multiple R-squared: 0.666,Adjusted R-squared: XXXX F-statistic: XXXX on 4 and 28 DF The X T X matrix for this linear model is given by (approximately) X = model.matrix(model) t(X) %*% X (Intercept) v1 v2 v3 v4 (Intercept) 33 0 0 0 0 v1 0 4343.7 0 0 0 v2 0 0 152.54 0 0 v3 0 0 0 230.76 0 v4 0 0 0 0 58.53 a) Fill in the three missing values in the R output above, giving appropriate reasons. 2

STAT 151A Final Review Session December 6, 2023 b) You decide to drop the variables v3 and v4 from the above model which results in the following fit: Call: lm(formula = seatpos$hipcenter ~ v1 + v2) Residuals: Min 1Q Median 3Q Max -101.318 -27.873 7.407 23.938 71.741 Coefficients: Estimate Std. Error (Intercept) XXXXXX XXXXXX v1 XXXXXX XXXXXX v2 XXXXXX XXXXXX --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 39.02 on 30 degrees of freedom Multiple R-squared: 0.6273,Adjusted R-squared: 0.6024 F-statistic: 25.24 on 2 and 30 DF Fill the six missing values above, explaining your answers. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Final Review Session December 6, 2023 2 Hypothesis Testing Let’s say I have collected the following data: • a single categorical variable with 3 categories, • a continuous independent variable (call this ⃗x 1 ), and • a proportion outcome, ⃗ y ∈ (0 , 1) I assume there are no other relevant variables for the data-generating process. Design a linear model and F-distributed test statistic to jointly test these null hypotheses: • There is no relationship of ⃗x 1 on ⃗ y within categorical group 1. • The relationship of ⃗x 1 on ⃗ y within categorical group 2 is the same as the relation of ⃗x 1 on ⃗ y within categorical group 3. Toward this end, please answer the following questions: a) What is the data matrix for the linear model; b) Write out the model formulation; c) Note any assumptions that make the F-test valid (ie. our canonical assumptions for linear modeling in the course thus far); d) Construct a matrix L for general linear hypothesis to jointly test the two null hypoth- esis. How to compute the relevant F-test statistic (including it’s degrees of freedom parameters, as a function of sample size, n ), and what distribution does it follow? 4

STAT 151A Final Review Session December 6, 2023 3 Categorical regression Let’s imagine that 80 students took a particular course at sophomores, 20 were juniors and 20 were seniors. In R, I have saved the final scores (out of 100) for the 20 freshmen in the vector g1 , for the 20 sophomores in g2 , juniors in g3 and seniors in g4 . Also, for i = 1 , · · · 80, let - y i : Final score of the i th student in the class - x i 1 : Takes the value of 1 if the i th student is a freshman and 0 otherwise - x i 2 : Takes the value of 1 if the i th student is a sophomore and 0 otherwise - x i 3 : Takes the value of 1 if the i th student is a junior and 0 otherwise - x i 4 : Takes the value of 1 if the i th student is a senior and 0 otherwise I fit the linear model: y i = β 0 + β 1 x i 1 + β 2 x i 2 + β 3 x i 3 + β 4 x i 4 + ϵ i to this data via R to obtain the following output: a) Why does the R output above say ”1 not defined because of singularities”? How would you fix the problem? b) Fill in the 3 missing values in the R output with proper reasoning. c) Explain why the standard error estimates for the coefficients of x 1, x 2, x 3 are all the same. 5

STAT 151A Final Review Session December 6, 2023 4 Bootstrap 4.1 Bootstrap basics Consider the following two observed vectors: y =       50 60 90 120 140       , x =       20 100 180 260 340       The fitted model lm(y ~ x) returns Coefficients: (Intercept) x 38.0 0.3 a) Suppose we create a new sample by bootstrapping cases, and let z ∗ i = ( y ∗ i , x ∗ i ) be the ordered pair of values for the i th observation in this bootstrap sample. Write down the exact distribution of z ∗ 1 . b) Suppose now that we generate the bootstrap sample by bootstrapping residuals instead of cases. Is the distribution of z ∗ 1 the same as in part (a)? If so, explain why. If not, provide the new distribution. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Final Review Session December 6, 2023 4.2 Bootstrap CIs The summary of a fitted model is: Below are the percentiles for a bootstrapped estimate of the sampling distribution of the studentized ˆ β in this simple linear model. Construct a 95% confidence interval for β , using the bootstrapped studentized distribution. 7

STAT 151A Final Review Session December 6, 2023 5 Logistic regression For this question we use a dataset containing consists of 4000 emails of which 1556 emails were identified as spam. The response variable, named yesno, takes the value y if the email is spam and n otherwise. The explanatory variables include crl.tot (total length of words in capitals), dollar (number of occurences of the $ symbol), bang (number of occurences of the symbol), money (number of occurences of the word money), n000 (number of occurences of the string 000) and make (number of occurences of the word make). We fit a logistic regression model to the data via s = 0.001 M1 <- glm(yesno~ log(crl.tot) + log(dollar+s) + log(bang+s) +log(money+s) + log(n000+s) + log(make+s), family=binomial, data=spam) summary(M1) This gave me the following output: Call: glm(formula = yesno ~ log(crl.tot) + log(dollar + s) + log(bang + s) + log(money + s) + log(n000 + s) + log(make + s), family = binomial, data = spam) Deviance Residuals: Min 1Q Median 3Q Max -3.0938 -0.4468 -0.2895 0.3731 2.7087 Coefficients: Estimate Std. Error z value (Intercept) 3.86 0.38 XXXX log(crl.tot) 0.31 0.04 7.81 log(dollar + s) 0.31 0.03 12.59 log(bang + s) 0.40 0.02 23.70 log(money + s) 0.32 0.03 10.84 log(n000 + s) 0.20 0.03 6.33 log(make + s) -0.11 0.02 -4.88 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5346.4 on 3999 degrees of freedom Residual deviance: 2870.9 on 3993 degrees of freedom AIC: 2884.9 Number of Fisher Scoring iterations: 6 8

STAT 151A Final Review Session December 6, 2023 a) Fill in the first missing value in the above output giving appropriate reasons. b) Suppose a new email comes in for which crl.tot dollar bang money n000 make 157 0.868 2.894 0 0 0 Using the above logistic regression model, do we predict that this email is most likely spam or not? c) It may be noted that in the model M1, I took logarithms of the explanatory variables. I decided to fit another logistic regression model without taking logarithms of the explanatory variables: M2 = glm(yesno ~ crl.tot+dollar+bang+money+n000+make, family=binomial, data=spam) The residual deviance for this model turned out to be 3508.9. On the basis of this, which of the two models M1 and M2 would you use and why? 9

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Final Review Session December 6, 2023 6 Model selection We analyze a subset of the bodyfat dataset used in lab. Consider the following R code and R output: > lmod = lm(BODYFAT ~ AGE + WEIGHT + HEIGHT + KNEE + BICEPS + WRIST, data = body_subset) > summary(lmod) Call: lm(formula = bodyfat ~ Age + Weight + Height + Knee + Biceps + Wrist, data = body_subset) Residuals: Min 1Q Median 3Q Max -18.7488 -3.7902 0.0035 3.8497 15.0668 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 48.41126 13.72912 3.526 0.000527 *** Age 0.22626 0.03491 6.482 7.37e-10 *** Weight 0.22758 0.03352 6.790 1.35e-10 *** Height -0.47390 0.11390 -4.161 4.77e-05 *** Knee 0.05696 0.32633 0.175 0.861613 Biceps 0.22976 0.22809 1.007 0.315043 Wrist -3.10337 0.72066 -4.306 2.64e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.69 on 193 degrees of freedom Multiple R-squared: 0.5447,Adjusted R-squared: 0.5305 F-statistic: 38.48 on 6 and 193 DF, p-value: < 2.2e-16 > library(leaps) > vs <-regsubsets(BODYFAT~AGE + WEIGHT + HEIGHT + KNEE + BICEPS + WRIST,body) > rs <- summary(vs) > rs$which (Intercept) AGE WEIGHT HEIGHT KNEE BICEPS WRIST 1 TRUE FALSE TRUE FALSE FALSE FALSE XXXXX 2 TRUE FALSE TRUE TRUE FALSE FALSE FALSE 3 TRUE TRUE TRUE XXXXX FALSE FALSE FALSE 4 TRUE XXXXX TRUE XXXXX FALSE FALSE TRUE 5 TRUE TRUE TRUE TRUE XXXXX TRUE TRUE 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE 10

STAT 151A Final Review Session December 6, 2023 > rs$cp [1] XXXXX 45.256690 19.651287 4.035405 5.030469 7.000000 > rs$adjr2 [1] 0.3312390 0.4298340 0.4930463 0.5328571 0.5328810 0.5305348 > rs$rss [1] 9131.252 7745.719 6852.019 6281.721 6249.187 6248.200 a) There are five missing values (indicated by XXXXX) in the rs$which output above. Fill them in, explaining your reasoning. b) Fill in the missing value in the output of rs$cp , explaining you reasoning c) Based on the output above, which model would you use for predicting bodyfat based on the explanatory variables, and why? 11

STAT 151A Final Review Session December 6, 2023 7 Shrikage Applied Analysis The Online News Popularity dataset contains information on news articles from the website Mashable.com, including the number of times each article was shared. Using a subset of 30,000 news articles, we model the number of shares as a function of the 58 other explanatory variables using shrinkage methods. The following two plots show cross-validation error (top) and coefficient values (bottom) against log-lambda values for ridge regression (at left) and lasso regression (at right). a) What is the meaning of the sequences of numbers written across the top of the two cross-validation plots? Interpret and explain the differences between the two sequences. b) A different statistician fits an OLS model to this same data. Based on the plots above, do you think the R2 for this model will be closer to one or closer to zero? Explain your answer. c) Two of the 58 explanatory variables can be written as linear combina- tions of the other 56 variables. Comment on the respective implications for OLS and ridge regression. d) Suppose a new article is observed. Do you expect the lasso or the OLS model to have lower MSE in predicting how many times it will be shared? Why? 12

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Final Review Session December 6, 2023 13

STAT 151A Final Review Session December 6, 2023 8 Conceptual review: T/F 1. The residuals from a least-squares regression fit are always orthogonal to the fitted values, even when the linearity assumption E ( ϵ ) = 0 is violated. 2. The maximum likelihood estimator for β in the Normal linear model is a biased esti- mator. 3. In simple regression, rejecting the null hypothesis that β = 0 is equivalent to concluding that x has a causal effect on y . 4. The leverage for the i th subject measures how far the i th subject is from the rest of the subjects in terms of the explanatory variable values. 5. The semiparametric bootstrap from lecture does not implicitly assume that the errors in the model are identically distributed. 6. R 2 is an effective model selection criterion for deciding the best size for a linear model. 7. In a model with p continuous explanatory variables, Mallow’s C p must be ≥ p + 1. 8. Transformations of the explanatory variables can potentially improve the fit of the model in logistic regression. 9. The MLE of β in a logistic regression model can always be computed in closed form. 10. The logit transformation is good for proportion data, because it is a mono- tonic trans- formation from the real numbers to the range (0, 1). 11. The log-likelihood for a saturated logistic regression model is 0. 14

final_review_session

Related Documents