midterm_review_session
pdf
keyboard_arrow_up
School
Hong Kong Polytechnic University *
*We aren’t endorsed by this school
Course
273
Subject
Statistics
Date
Nov 24, 2024
Type
Pages
8
Uploaded by lixun73230
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
Note: there is no submission required for lab 7. This worksheet doesn’t include everything
you need to review for the midterm. Please see the midterm study guide posted on bCourses
for a more comprehensive list of concepts, examples and exercises.
1
Data transformation
Problem 1 Conceptual Review
(a) Why do we transform data?
(b) What is Box-Cox transformation on
X
?
(c) What
p
do you use to correct positive skewness (right skew)? What
p
do you use to
correct negative skewness (left skew)?
(d) A good transformation will make this ratio
UQ
−
M
M
−
LQ
close to 1.
(e) What is Tukey and Mosteller’s bulging rule and how to use it to correct monotone
non-linearity?
Problem 2 Excercise 4.1 - Fox
Creat a graph for the ordinary power transformations
X
→
X
p
for
p
=
−
1
,
0
,
1
,
2
,
3. (When
p
= 0, however, use the log transformation.) Compare the graph to Figure 4.1, and comment
on the similarities and differences between the two families of transformations
x
p
and (
x
p
−
1)
/p
.
1
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
2
Simple linear regression
Problem 3 SLR review
Consider simple linear regression
y
i
=
β
0
+
β
1
x
i
+
ϵ
i
.
(a) what are the assumptions?
(b) Derive the least squares estimates of
β
0
and
β
1
.
(c) Show that
ˆ
β
0
and
ˆ
β
1
are unbiased. What assumptions are used?
(d) Derive
var
(
ˆ
β
0
),
var
(
ˆ
β
1
) and
cov
(
ˆ
β
0
,
ˆ
β
1
). What assumptions are used?
(e) What is an unbiased estimator for
σ
2
?
Problem 4 TSS, RSS and
R
2
review
Consider simple linear regression
y
i
=
β
0
+
β
1
x
i
+
ϵ
i
under standard linear model assumptions:
(a) What is residual standard error and how to interpret it?
(b) What are total sum of squares, regression sum of squares, and residual sum of squares?
(c) Definition of R-squared and what does it represent?
Problem 5 (SP23 HW)
Consider simple linear regression where there is one response variable
y
and an explanatory
variable
x
and there are
n
subjects with values
y
1
,
·
, y
n
and
x
1
,
· · ·
, x
n
.
(a) What are the estimates for
α
0
and
α
1
if we regress
x
on
y
?
(b) Let
ˆ
β
0
and
ˆ
β
1
be the estimate from regressing
y
on
x
.
Intuition might suggest that
ˆ
α
1
= 1
/
ˆ
β
1
. Is this true?
Problem 6 Excercise 5.9
Show that in simple-regression analysis, the standardized slope coefficient
B
is equal to
the correlation coefficient
r
.
(In general, however, standardized slope coefficients are not
correlations and can be outside of the range [0, 1].)
2
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
3
Multiple regression
Problem 7 MR Review
Consider multiple regression
⃗
y
=
Xβ
+
⃗
ϵ
.
(a) what are the assumptions?
(b) Derive the least squares estimates of
β
.
(c) Show that
ˆ
β
is unbiased. What assumptions are used?
(d) Derive
cov
(
ˆ
β
). What assumptions are used?
(e) What is an unbiased estimator for
σ
2
?
Problem 8 Other concepts of MR
(a) what is adjusted R-squared? Why
R
2
can only rise?
(b) How do correlated variables impact the regression coefficient?
(c) What are the standardized coefficient and how to interpret them?
Problem 9 True/False (Past midterm)
(a)
R
2
is an effective model selection criterion for deciding the best size for a linear model.
(b) If I assume the data-generating process is
⃗
y
=
Xβ
+
⃗
ϵ
with full rank matrix
X
treated
as fixed, then the following is true:
arg min
||
Xβ
−
⃗
y
||
2
2
= (
X
T
X
)
−
1
X
T
⃗
y
regardless of the distribution of
ϵ
.
(c) The R-squared summary output will always increase if I add more covariates to the
regression.
Problem 10 SP23 midterm
In many data analyses,
⃗
y
observations are collected from various sensors with different mea-
surement variabilities. Let’s say that I know the variability of each sensor such that I can
safely assume the following model:
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
⃗
y
=
Xβ
+
⃗
ϵ,
⃗
ϵ
∼
N
(0
, σ
2
w
2
1
0
· · ·
0
0
w
2
2
· · ·
0
.
.
.
.
.
.
0
0
0
· · ·
w
2
n
)
(a) What is the solution (call this
ˆ
β
OLS
) to the following optimization problem:
arg min
||
Xβ
−
⃗
y
||
2
2
(b) Let
⃗w
=
w
1
w
2
.
.
.
w
n
.
Show the following
V ar
(
ˆ
β
OLS
) =
σ
2
(
X
T
X
)
−
1
X
T
⃗w ⃗w
T
X
(
X
T
X
)
−
1
(c) Let
⃗w ⃗w
T
=
W
, and note
W
−
1
=
1
/w
1
1
/w
2
· · ·
1
/w
n
1
/w
1
1
/w
2
.
.
.
1
/w
n
.
Consider the
transformed model:
W
−
1
/
2
⃗
y
=
W
−
1
/
2
Xβ
+
W
−
1
/
2
⃗
ϵ
Show that the least square estimator (call this
ˆ
β
WLS
) for model above is
ˆ
β
WLS
= (
X
T
W
−
1
X
)
−
1
X
T
W
−
1
⃗
y
(d) Show that
ˆ
β
WLS
is unbiased for
β
.
(e) Compute variance of
ˆ
β
WLS
as an expression involving one instance of each of the fol-
lowing:
X, X
T
, W
−
1
, σ
2
.
Problem 11 Partial coefficient - FWL theorem
(a) How to compute partial coefficient and its standard error?
(b) What is the variance-inflation factor and how does it relate to the coefficient variance?
4
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
4
Geometry and matrix form of linear models
Problem 12 Gram-Schmidt (SP23 practice midterm)
Consider running a multiple linear regression of ˆ
y
on
X
= [
⃗
1
⃗x
⃗x
2
]
,
where
⃗x
2
is a vector of the squared corresponding elements of
⃗x
, and the elements of
⃗x
are
larger than 10.
(a) Is there a unique solution to the following optimization problem? Explain why or why
not.
(b) Find an orthogonal basis for
X
.
(c) Write the OLS predictions, ˆ
y
, as a function of the orthogonal basis in (b). No need to
fully simplify.
Problem 13 Hat matrix (SP23 practice midterm)
In this problem we will analyze some properties of the ”hat matrix” from the linear model.
Specifically, consider the multiple linear regression model
⃗
y
=
X
⃗
β
+
⃗
ϵ
, with
⃗
ϵ
∼
N
(0
, σ
2
I
).
Recall the hat matrix is defined as:
H
=
X
(
X
T
X
)
−
1
X
T
,
where
X
∈
R
n
×
(
p
+1)
is full column rank.
(a) Consider the predicted values ˆ
y
=
X
ˆ
β
. Show that ˆ
y
has variance
σ
2
H
.
(b) Let
⃗e
=
⃗
y
−
ˆ
y
. Show that
⃗e
= (
I
−
H
)
⃗
y
.
(c) Show that (
I
−
H
) is symmetric and idempotent.
(d) Show that
var
(
⃗e
) =
σ
2
(
I
−
H
).
(e) Show that
ˆ
β
and
⃗e
are independent.
5
Sample dataset questions
Problem 14 Modeling Sugar Cane Production (SP23 midterm)
Suppose you have been hired as a consultant by the sugar company that operates these
sugarcane fields. Your job is to build a linear model to predict the sugarcane production in
tons per hectare. You are provided a dataset with columns:
5
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
•
Region: region (defined by physical position and average rainfall) in which each paddock
is located.
•
Position:
geographic position of each paddock in the general area according to the
compass directions (E = east, W = west, N = north, S = south, C = central).
•
Area: size of the paddock in hectares.
•
Age: years elapsed since the paddock was plowed out and planted with new sugarcane
seeds.
•
HarvestMonth: month of the year in which the harvest took place (1 = January, 2 =
February, etc.).
•
HarvestDuration: time taken to harvest the sugarcane in days.
•
Tonn.Hect: tons per hectare of sugarcane produced by this paddock.
•
Rainfall.96: Total rainfall for the district from July 1996 through December 1996 (mil-
limeters).
(a) Which variable is your response variable. Which variables are continuous/categorical?
(b) You plot the distribution for Tonn.Hect.
How would you describe the spread of the
variable. For the purposes of your model, would you transform this variable, and if yes,
how so?
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
(c) You create the following correlation matrix.
Based on the following correlations, in a linear regression model where Tonn.Hect is
the response variable, how would the covariates Age and HarvestDuration impact the
coefficient of Area?
(d) HarvestMonth can be used as a continuous or categorical variable for our model. What
are some drawbacks of using HarvestMonth as continuous over categorical?
This question is a little beyond the scope of this midterm.
So here is the solution if
you’re curious:
Categorical variable captures non-linear relationship between variable and response (i.e
if there is a difference in re- sponse variable between various months).
Additionally,
treating Month as continuous us uninterpretable, as our model will con- sider non-
integer values for Month.
(e) You fit the following linear model, and the R summary is as follows:
Tonn.Hect
∼
Area
+
HarvestMonth
+
Position
+
Region
+
Rainfall.
96
7
STAT 151A
Lab 7: Midterm Review Session
October 6, 2023
Based on this output, is it safe to assume that Region is unimportant to our model due
to most of the categories having a low t-value? Explain your reasoning.
This question is a little beyond the scope of this midterm.
So here is the solution if
you’re curious:
No, we cannot look at the individual t-values of each category to make an overall claim
about the entire variable. We would need to use an F-test where we test the significance
of all categories together.
Problem 15 Model diagnostic (SP23 practice exam)
Below are the residual vs. fitted plot and the Q-Q plot from a model. Describe what
problems you see, if any, in the assumptions of the model. If you see problems in these
diagnostic plots, describe what you might suggest to get an improved regression model.
8
Related Documents
Related Questions
Using the data in the first image can you give me answers to questions in 2nd image(4,5,6).Can you plz atleast explain how to do 5&6 (6 is the main one) plz help......
arrow_forward
(b.) Consider the fictitious set of data shown below, where the line through the data is the fitted simple linear
regression line. Sketch a residual plot (It doesn't need to be perfect) on the right side of this graph. What type of
transformation is needed to get a proper SLR model? Write the general form of this new SLR model.
§
arrow_forward
Part 1 of 4
Does the length of a surgery patient's stay in the hospital depend on the length of time the
operation took? The table below gives the operative time (in hours) and the length of the hospital
stay (in days) for 10 patients. (Comma separated lists of the data are also provided below the table
to ease in copying the data to R.)
Operative Time (x)
Length of Hospital Stay (y)
5
13
13
12
12
14
17
4
12
3
12
7
15
3
x: 5, 5, 5, 2, 6, 5, 4, 3, 7, 3
y: 13, 13, 12, 12, 14, 17, 12, 12, 15, 7
Conduct a hypothesis test with a 5% level of significance to determine whether or not the operative
time and the length of a patient's hospital stay are correlated.
Step 1: State the null and alternative hypotheses.
Ho:p =
test.)
(So we will be performing a two-tailed
O Rost to forum
arrow_forward
The Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined.
Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit.
Discuss why regression analysis is important in decision-making.
arrow_forward
Part 1 of 4
Does the length of a surgery patient's stay in the hospital depend on the length of time the
operation took? The table below gives the operative time (in hours) and the length of the hospital
stay (in days) for 10 patients. (Comma separated lists of the data are also provided below the table
to ease in copying the data to R.)
Operative Time (x)
Length of Hospital Stay (y)
13
13
12
12
6.
14
5
17
4
12
3.
12
7
15
7
x: 5, 5, 5, 2, 6, 5, 4, 3, 7, 3
y: 13, 13, 12, 12, 14, 17, 12, 12, 15, 7
Conduct a hypothesis test with a 5% level of significance to determine whether or not the operative
time and the length of a patient's hospital stay are correlated.
Step 1: State the null and alternative hypotheses.
Ha:p #v 0
v test.)
(So we will be performing a two-tailed
Part 2 of 4
arrow_forward
How do you make the z-tables that go with this question?
arrow_forward
I need some assistance solving Part B of this question. Refer to the excel data in the image provided to answer Part B. SoftBus Company sells PC equipment and customized software to small companies to help them manage their day-to-day business activities. Although SoftBus spends time with all customers to understand their needs, the customers are eventually on their own to use the equipment and software intelligently. To understand its customers better, SoftBus recently sent questionnaires to a large number of prospective customers. Key personnel—those who would be using the software—were asked to fill out the questionnaire. SoftBus received 82 usable responses, as shown in the file. You can assume that these employees represent a random sample of all of SoftBus's prospective customers. SoftBus believes it can afford to spend much less time with customers who own PCs and score at least 4 on PC Knowledge. Let's call these the "PC-savvy" customers. On the other hand, SoftBus believes it…
arrow_forward
The r code for side by side boxplot of vitamind v newage and vitamin d v country.
Scatterplot code for relationship between vitamin d level and age.
arrow_forward
The November 24, 2001, issue of The Economist published economic data for 15
industrialized nations. Included were the percent changes in gross domestic product (GDP),
industrial production (IP), consumer prices (CP), and producer prices (PP) from Fall 2000
to Fall 2001, and the unemployment rate in Fall 2001 (UNEMP). An economist wants to
construct a model to predict GDP from the other variables. A fit of the model
GDP = , + P,IP + 0,UNEMP + f,CP + P,PP + €
yields the following output:
The regression equation is
GDP = 1.19 + 0.17 IP + 0.18 UNEMP + 0.18 CP – 0.18 PP
Predictor
Coef SE Coef
тР
Constant
1.18957 0.42180 2.82 0.018
IP
0.17326 0.041962 4.13 0.002
UNEMP
0.17918 0.045895 3.90 0.003
CP
0.17591 0.11365 1.55 0.153
PP
-0.18393 0.068808 -2.67 0.023
Predict the percent change in GDP for a country with IP = 0.5, UNEMP = 5.7, CP =
3.0, and PP = 4.1.
a.
b.
If two countries differ in unemployment rate by 1%, by how much would you predict
their percent changes in GDP to differ, other…
arrow_forward
The quadratic model for the given data is wrong.
arrow_forward
Please help with these questions.
arrow_forward
Microsoft Office Home
My files- OneDrive
P Presentation2.pptx
MyLab and Mastering
A mathxl.com/Student/PlayerHomework.aspx?homeworkid%3D615251416&questionld%3D7&flushed%3false&cid%3D6642805&back=
Math 1043 College Algebra FA2021 (JPost)
Homework: Homework 6.7
How long does it take for an investment to double in value if it is invested at 5% compounded monthly? Compounded continuously?
At 5% compounded monthly, the investment doubles in about 13.89 years.
(Round to two decimal places as needed.)
At 5% compounded continuously, the investment doubles in about
(Round to two decimal places as needed.)
years.
arrow_forward
Please show and explain all work in an easy-to-read format!!!!! Please and thank you!!!!
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Using the data in the first image can you give me answers to questions in 2nd image(4,5,6).Can you plz atleast explain how to do 5&6 (6 is the main one) plz help......arrow_forward(b.) Consider the fictitious set of data shown below, where the line through the data is the fitted simple linear regression line. Sketch a residual plot (It doesn't need to be perfect) on the right side of this graph. What type of transformation is needed to get a proper SLR model? Write the general form of this new SLR model. §arrow_forwardPart 1 of 4 Does the length of a surgery patient's stay in the hospital depend on the length of time the operation took? The table below gives the operative time (in hours) and the length of the hospital stay (in days) for 10 patients. (Comma separated lists of the data are also provided below the table to ease in copying the data to R.) Operative Time (x) Length of Hospital Stay (y) 5 13 13 12 12 14 17 4 12 3 12 7 15 3 x: 5, 5, 5, 2, 6, 5, 4, 3, 7, 3 y: 13, 13, 12, 12, 14, 17, 12, 12, 15, 7 Conduct a hypothesis test with a 5% level of significance to determine whether or not the operative time and the length of a patient's hospital stay are correlated. Step 1: State the null and alternative hypotheses. Ho:p = test.) (So we will be performing a two-tailed O Rost to forumarrow_forward
- The Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined. Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit. Discuss why regression analysis is important in decision-making.arrow_forwardPart 1 of 4 Does the length of a surgery patient's stay in the hospital depend on the length of time the operation took? The table below gives the operative time (in hours) and the length of the hospital stay (in days) for 10 patients. (Comma separated lists of the data are also provided below the table to ease in copying the data to R.) Operative Time (x) Length of Hospital Stay (y) 13 13 12 12 6. 14 5 17 4 12 3. 12 7 15 7 x: 5, 5, 5, 2, 6, 5, 4, 3, 7, 3 y: 13, 13, 12, 12, 14, 17, 12, 12, 15, 7 Conduct a hypothesis test with a 5% level of significance to determine whether or not the operative time and the length of a patient's hospital stay are correlated. Step 1: State the null and alternative hypotheses. Ha:p #v 0 v test.) (So we will be performing a two-tailed Part 2 of 4arrow_forwardHow do you make the z-tables that go with this question?arrow_forward
- I need some assistance solving Part B of this question. Refer to the excel data in the image provided to answer Part B. SoftBus Company sells PC equipment and customized software to small companies to help them manage their day-to-day business activities. Although SoftBus spends time with all customers to understand their needs, the customers are eventually on their own to use the equipment and software intelligently. To understand its customers better, SoftBus recently sent questionnaires to a large number of prospective customers. Key personnel—those who would be using the software—were asked to fill out the questionnaire. SoftBus received 82 usable responses, as shown in the file. You can assume that these employees represent a random sample of all of SoftBus's prospective customers. SoftBus believes it can afford to spend much less time with customers who own PCs and score at least 4 on PC Knowledge. Let's call these the "PC-savvy" customers. On the other hand, SoftBus believes it…arrow_forwardThe r code for side by side boxplot of vitamind v newage and vitamin d v country. Scatterplot code for relationship between vitamin d level and age.arrow_forwardThe November 24, 2001, issue of The Economist published economic data for 15 industrialized nations. Included were the percent changes in gross domestic product (GDP), industrial production (IP), consumer prices (CP), and producer prices (PP) from Fall 2000 to Fall 2001, and the unemployment rate in Fall 2001 (UNEMP). An economist wants to construct a model to predict GDP from the other variables. A fit of the model GDP = , + P,IP + 0,UNEMP + f,CP + P,PP + € yields the following output: The regression equation is GDP = 1.19 + 0.17 IP + 0.18 UNEMP + 0.18 CP – 0.18 PP Predictor Coef SE Coef тР Constant 1.18957 0.42180 2.82 0.018 IP 0.17326 0.041962 4.13 0.002 UNEMP 0.17918 0.045895 3.90 0.003 CP 0.17591 0.11365 1.55 0.153 PP -0.18393 0.068808 -2.67 0.023 Predict the percent change in GDP for a country with IP = 0.5, UNEMP = 5.7, CP = 3.0, and PP = 4.1. a. b. If two countries differ in unemployment rate by 1%, by how much would you predict their percent changes in GDP to differ, other…arrow_forward
- The quadratic model for the given data is wrong.arrow_forwardPlease help with these questions.arrow_forwardMicrosoft Office Home My files- OneDrive P Presentation2.pptx MyLab and Mastering A mathxl.com/Student/PlayerHomework.aspx?homeworkid%3D615251416&questionld%3D7&flushed%3false&cid%3D6642805&back= Math 1043 College Algebra FA2021 (JPost) Homework: Homework 6.7 How long does it take for an investment to double in value if it is invested at 5% compounded monthly? Compounded continuously? At 5% compounded monthly, the investment doubles in about 13.89 years. (Round to two decimal places as needed.) At 5% compounded continuously, the investment doubles in about (Round to two decimal places as needed.) years.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:CengageBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt