midterm_review_session

pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by lixun73230

STAT 151A Lab 7: Midterm Review Session October 6, 2023 Note: there is no submission required for lab 7. This worksheet doesn’t include everything you need to review for the midterm. Please see the midterm study guide posted on bCourses for a more comprehensive list of concepts, examples and exercises. 1 Data transformation Problem 1 Conceptual Review (a) Why do we transform data? (b) What is Box-Cox transformation on X ? (c) What p do you use to correct positive skewness (right skew)? What p do you use to correct negative skewness (left skew)? (d) A good transformation will make this ratio UQ − M M − LQ close to 1. (e) What is Tukey and Mosteller’s bulging rule and how to use it to correct monotone non-linearity? Problem 2 Excercise 4.1 - Fox Creat a graph for the ordinary power transformations X → X p for p = − 1 , 0 , 1 , 2 , 3. (When p = 0, however, use the log transformation.) Compare the graph to Figure 4.1, and comment on the similarities and differences between the two families of transformations x p and ( x p − 1) /p . 1

STAT 151A Lab 7: Midterm Review Session October 6, 2023 2 Simple linear regression Problem 3 SLR review Consider simple linear regression y i = β 0 + β 1 x i + ϵ i . (a) what are the assumptions? (b) Derive the least squares estimates of β 0 and β 1 . (c) Show that ˆ β 0 and ˆ β 1 are unbiased. What assumptions are used? (d) Derive var ( ˆ β 0 ), var ( ˆ β 1 ) and cov ( ˆ β 0 , ˆ β 1 ). What assumptions are used? (e) What is an unbiased estimator for σ 2 ? Problem 4 TSS, RSS and R 2 review Consider simple linear regression y i = β 0 + β 1 x i + ϵ i under standard linear model assumptions: (a) What is residual standard error and how to interpret it? (b) What are total sum of squares, regression sum of squares, and residual sum of squares? (c) Definition of R-squared and what does it represent? Problem 5 (SP23 HW) Consider simple linear regression where there is one response variable y and an explanatory variable x and there are n subjects with values y 1 , · , y n and x 1 , · · · , x n . (a) What are the estimates for α 0 and α 1 if we regress x on y ? (b) Let ˆ β 0 and ˆ β 1 be the estimate from regressing y on x . Intuition might suggest that ˆ α 1 = 1 / ˆ β 1 . Is this true? Problem 6 Excercise 5.9 Show that in simple-regression analysis, the standardized slope coefficient B is equal to the correlation coefficient r . (In general, however, standardized slope coefficients are not correlations and can be outside of the range [0, 1].) 2

STAT 151A Lab 7: Midterm Review Session October 6, 2023 3 Multiple regression Problem 7 MR Review Consider multiple regression ⃗ y = Xβ + ⃗ ϵ . (a) what are the assumptions? (b) Derive the least squares estimates of β . (c) Show that ˆ β is unbiased. What assumptions are used? (d) Derive cov ( ˆ β ). What assumptions are used? (e) What is an unbiased estimator for σ 2 ? Problem 8 Other concepts of MR (a) what is adjusted R-squared? Why R 2 can only rise? (b) How do correlated variables impact the regression coefficient? (c) What are the standardized coefficient and how to interpret them? Problem 9 True/False (Past midterm) (a) R 2 is an effective model selection criterion for deciding the best size for a linear model. (b) If I assume the data-generating process is ⃗ y = Xβ + ⃗ ϵ with full rank matrix X treated as fixed, then the following is true: arg min || Xβ − ⃗ y || 2 2 = ( X T X ) − 1 X T ⃗ y regardless of the distribution of ϵ . (c) The R-squared summary output will always increase if I add more covariates to the regression. Problem 10 SP23 midterm In many data analyses, ⃗ y observations are collected from various sensors with different mea- surement variabilities. Let’s say that I know the variability of each sensor such that I can safely assume the following model: 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Lab 7: Midterm Review Session October 6, 2023 ⃗ y = Xβ + ⃗ ϵ, ⃗ ϵ ∼ N (0 , σ 2      w 2 1 0 · · · 0 0 w 2 2 · · · 0 . . . . . . 0 0 0 · · · w 2 n      ) (a) What is the solution (call this ˆ β OLS ) to the following optimization problem: arg min || Xβ − ⃗ y || 2 2 (b) Let ⃗w =      w 1 w 2 . . . w n      . Show the following V ar ( ˆ β OLS ) = σ 2 ( X T X ) − 1 X T ⃗w ⃗w T X ( X T X ) − 1 (c) Let ⃗w ⃗w T = W , and note W − 1 = 1 /w 1 1 /w 2 · · · 1 /w n      1 /w 1 1 /w 2 . . . 1 /w n      . Consider the transformed model: W − 1 / 2 ⃗ y = W − 1 / 2 Xβ + W − 1 / 2 ⃗ ϵ Show that the least square estimator (call this ˆ β WLS ) for model above is ˆ β WLS = ( X T W − 1 X ) − 1 X T W − 1 ⃗ y (d) Show that ˆ β WLS is unbiased for β . (e) Compute variance of ˆ β WLS as an expression involving one instance of each of the fol- lowing: X, X T , W − 1 , σ 2 . Problem 11 Partial coefficient - FWL theorem (a) How to compute partial coefficient and its standard error? (b) What is the variance-inflation factor and how does it relate to the coefficient variance? 4

STAT 151A Lab 7: Midterm Review Session October 6, 2023 4 Geometry and matrix form of linear models Problem 12 Gram-Schmidt (SP23 practice midterm) Consider running a multiple linear regression of ˆ y on X = [ ⃗ 1 ⃗x ⃗x 2 ] , where ⃗x 2 is a vector of the squared corresponding elements of ⃗x , and the elements of ⃗x are larger than 10. (a) Is there a unique solution to the following optimization problem? Explain why or why not. (b) Find an orthogonal basis for X . (c) Write the OLS predictions, ˆ y , as a function of the orthogonal basis in (b). No need to fully simplify. Problem 13 Hat matrix (SP23 practice midterm) In this problem we will analyze some properties of the ”hat matrix” from the linear model. Specifically, consider the multiple linear regression model ⃗ y = X ⃗ β + ⃗ ϵ , with ⃗ ϵ ∼ N (0 , σ 2 I ). Recall the hat matrix is defined as: H = X ( X T X ) − 1 X T , where X ∈ R n × ( p +1) is full column rank. (a) Consider the predicted values ˆ y = X ˆ β . Show that ˆ y has variance σ 2 H . (b) Let ⃗e = ⃗ y − ˆ y . Show that ⃗e = ( I − H ) ⃗ y . (c) Show that ( I − H ) is symmetric and idempotent. (d) Show that var ( ⃗e ) = σ 2 ( I − H ). (e) Show that ˆ β and ⃗e are independent. 5 Sample dataset questions Problem 14 Modeling Sugar Cane Production (SP23 midterm) Suppose you have been hired as a consultant by the sugar company that operates these sugarcane fields. Your job is to build a linear model to predict the sugarcane production in tons per hectare. You are provided a dataset with columns: 5

STAT 151A Lab 7: Midterm Review Session October 6, 2023 • Region: region (defined by physical position and average rainfall) in which each paddock is located. • Position: geographic position of each paddock in the general area according to the compass directions (E = east, W = west, N = north, S = south, C = central). • Area: size of the paddock in hectares. • Age: years elapsed since the paddock was plowed out and planted with new sugarcane seeds. • HarvestMonth: month of the year in which the harvest took place (1 = January, 2 = February, etc.). • HarvestDuration: time taken to harvest the sugarcane in days. • Tonn.Hect: tons per hectare of sugarcane produced by this paddock. • Rainfall.96: Total rainfall for the district from July 1996 through December 1996 (mil- limeters). (a) Which variable is your response variable. Which variables are continuous/categorical? (b) You plot the distribution for Tonn.Hect. How would you describe the spread of the variable. For the purposes of your model, would you transform this variable, and if yes, how so? 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

STAT 151A Lab 7: Midterm Review Session October 6, 2023 (c) You create the following correlation matrix. Based on the following correlations, in a linear regression model where Tonn.Hect is the response variable, how would the covariates Age and HarvestDuration impact the coefficient of Area? (d) HarvestMonth can be used as a continuous or categorical variable for our model. What are some drawbacks of using HarvestMonth as continuous over categorical? This question is a little beyond the scope of this midterm. So here is the solution if you’re curious: Categorical variable captures non-linear relationship between variable and response (i.e if there is a difference in re- sponse variable between various months). Additionally, treating Month as continuous us uninterpretable, as our model will con- sider non- integer values for Month. (e) You fit the following linear model, and the R summary is as follows: Tonn.Hect ∼ Area + HarvestMonth + Position + Region + Rainfall. 96 7

STAT 151A Lab 7: Midterm Review Session October 6, 2023 Based on this output, is it safe to assume that Region is unimportant to our model due to most of the categories having a low t-value? Explain your reasoning. This question is a little beyond the scope of this midterm. So here is the solution if you’re curious: No, we cannot look at the individual t-values of each category to make an overall claim about the entire variable. We would need to use an F-test where we test the significance of all categories together. Problem 15 Model diagnostic (SP23 practice exam) Below are the residual vs. fitted plot and the Q-Q plot from a model. Describe what problems you see, if any, in the assumptions of the model. If you see problems in these diagnostic plots, describe what you might suggest to get an improved regression model. 8