a5-solution
.pdf
keyboard_arrow_up
School
Rumson Fair Haven Reg H *
*We aren’t endorsed by this school
Course
101
Subject
Statistics
Date
Nov 24, 2024
Type
Pages
5
Uploaded by CoachRiverTiger30
Assignment 5: Linear Model Selection
SDS293 - Machine Learning
Due: 24 Oct 2017 by 11:59pm
Conceptual Exercises
6.8.1 (p. 259 ISLR)
We perform best subset, forward stepwise, and backward stepwise selection on a single data set.
For each approach, we obtain
p
+1 models, containing 0
,
1
,
2
, ..., p
predictors. Explain your answers:
(a) Which of the three models with
k
predictors has the smallest
training RSS
?
Solution:
Best subset selection has the smallest training RSS. Both forward and backward
selection determine models that depend on which predictors they pick first as they iterate
toward the
k
th
model, meaning that a poor choice early on cannot be undone.
(b) Which of the three models with k predictors has the smallest
test RSS
?
Solution:
Best subset selection
may
have the smallest test RSS because it considers more
models then the other methods. However, the other models might have better luck picking a
model that fits the test data better, as they would be less subject to overfitting. The outcome
will depend more heavily on the choice of test set / validation method than on the selection
method.
(c) True or False: the predictors in Model 1
are a subset of
the predictors in Model 2:
Model 1
Model 2
T/F
i.
Forward selection,
k
variables
Forward selection,
k
+ 1 variables
True
ii.
Backward selection,
k
variables
Backward selection,
k
+ 1 variables
True
iii.
Backward selection,
k
variables
Forward selection,
k
+ 1 variables
False
iv.
Forward selection,
k
variables
Backward selection,
k
+ 1 variables
False
v.
Best subset selection,
k
variables
Best subset selection,
k
+ 1 variables
False
Explain your reasoning.
1
Applied Exercises
6.8.8 parts a-d (p. 262-263 ISLR)
In this exercise, we will generate simulated data, and will then use this data to perform best subset
selection.
(a) Generate a predictor
X
of length n=100, as well as a noise vector
of length n=100.
Solution:
> set.seed(1)
> X=rnorm(100)
> eps=rnorm(100)
(b) Generate a response vector
Y
of length n=100 according to the model
Y
=
β
0
+
β
1
*
X
+
β
2
*
X
2
+
β
3
*
X
3
+
where
β
0
,
β
1
,
β
2
, and
β
3
are constants of your choice.
Solution:
Selecting
β
0
= 3
,
β
1
= 2
,
β
2
=
-
3
and
β
3
= 0
.
3
:
> beta0=3
> beta1=2
> beta2=-3
> beta3=0.3
> Y=beta0 + beta1 * X + beta2 * X
^
2 + beta3 * X
^
3 + eps
(c) Perform best subset selection in order to choose the best model containing the predictors
X, X
2
, ..., X
10
.
What is the best model obtained according to Cp, BIC, and adjusted
R
2
?
Show some plots to provide evidence for your answer, and report the coefficients of the best
model obtained.
Solution:
> library(leaps)
> data.full=data.frame(y=Y, x=X)
> mod.full=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10)
> mod.summary=summary(mod.full)
# Find the model size for best cp, BIC and adjr2
> min.cp=which.min(mod.summary
$
cp)
> min.bic=which.min(mod.summary
$
bic)
> max.adjr2=which.max(mod.summary
$
adjr2)
# Plot cp, BIC and adjr2
> plot(mod.summary
$
cp, xlab="Subset Size", ylab="Cp", pch=20, type="l")
> points(min.cp, mod.summary
$
cp[min.cp], pch=4, col="red", lwd=7)
> plot(mod.summary
$
bic, xlab="Subset Size", ylab="BIC", pch=20, type="l")
2
> points(min.bic, mod.summary
$
bic[min.bic], pch=4, col="red", lwd=7)
> plot(mod.summary
$
adjr2, xlab="Subset Size", ylab="adjr2", pch=20, type="l")
> points(max.adjr2, mod.summary
$
adjr2[max.adjr2], pch=4, col="red", lwd=7)
We find that all three criteria (Cp, BIC and Adjusted R2) criteria select 3-variable models.
The coefficients of the best 3-variable model are:
> coefficients(mod.full, id=3)
(Intercept)
poly(x, 10, raw=T)1
poly(x, 10, raw=T)2
poly(x, 10, raw=T)7
3.07627412
2.35623596
-3.16514887
0.01046843
(d) Repeat (c), using forward stepwise selection and also using backward stepwise selection. How
does your answer compare to the results in (c)?
Solution:
> mod.fwd=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10, method="forward")
> mod.bwd=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10, method="backward")
> fwd.summary=summary(mod.fwd)
> bwd.summary=summary(mod.bwd)
# Find best forward-selected model size
> min.cp.f=which.min(fwd.summary
$
cp)
> min.bic.f=which.min(fwd.summary
$
bic)
> max.adjr2.f=which.max(fwd.summary
$
adjr2)
# Find best backward-selected model size
> min.cp.b=which.min(bwd.summary
$
cp)
> min.bic.b=which.min(bwd.summary
$
bic)
> max.adjr2.b=which.max(bwd.summary
$
adjr2)
# Plot the statistics
> par(mfrow=c(3, 2))
# Forward Cp
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Plzz explain.
arrow_forward
What is the simultaneous equation bias? Give an example? What are the techniques used to estimate such model? What are the necessary conditions that are required to validly estimate the original models parameters?
arrow_forward
Please do not give solution in image format thanku
arrow_forward
Is there a relationship between total team salary and the performance of football teams? For a recent season, a linear model predicting Wins
(out of 16 regular season games) from the total team Salary (SM) for 32 teams in a football league is Wins = -6.353 +0.105 Salary. Complete
parts a through h below.
a) What is the explanatory variable?
The explanatory variable is
because
b) What is the response variable?
The response variable is
because
c) What does the slope mean in this context?
in this league, team
(Type an integer or a decimal. Do not round.)
are, on average, about
higher for every
d) What does the y-intercept mean in this context? Is it meaningful?
V is
This
v meaningful because it
The y-intercept is the average
of a team in this league whose
is
(Type an integer or a decimal. Do not round.)
e) If one team spends $10 million more than another on salary, how many more games on average would the first team be predicted to win?
O game(s)
(Type an integer or a decimal. Do not…
arrow_forward
Question 1 part D
arrow_forward
pls help ASAP
arrow_forward
Clocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.
arrow_forward
Question 1 part B. Thanks
arrow_forward
A random sample of n = 25 students in Gwinnett County schools were chosen to
participate in a study about remote / digital study. Of the 25 students, 20 reported
the Google Classroom provides the best access to digital content.
%3D
Which statement best describes the population and true parameter of this scenario?
O The population is all students in Gwinnett County school; the parameter is the
proportion of students that believe Google Classroom provides the best access
to digital content
The population is all students who participated in the remote / digital learning in
Gwinnett County; the parameter is reported proportion of students that
believe Google Classroom provides the best access to digital content
The population is all students who participated in the remote / digital learning3B
the parameter is the true proportion of students that believe Google Classroom
provides the best access to digital content
O The population is the 25 students who participated in the remote / digital…
arrow_forward
Can you answer A,B,C with clear answers. You can use the data in the second photo
arrow_forward
State the four conditions required for making regression inferences.
arrow_forward
The Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined.
Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit.
Discuss why regression analysis is important in decision-making.
arrow_forward
A'6
arrow_forward
Current Attempt in Progress
Please use the accompanying Excel data set or accompanying Text file data set when completing the following exercise.
An article in Urban Ecosystems, "Urbanization and Warming of Phoenix (Arizona, USA): Impacts, Feedbacks and Mitigation" (2002,
Vol. 6, pp. 183-203), mentions that Phoenix is ideal to study the effects of an urban heat island because it has grown from a
population of 300,000 to approximately 3 million over the last 50 years and this is a period with a continuous, detailed climate
record. The 50-year averages of the mean annual temperatures at eight sites in Phoenix are shown below. Check the assumption of
normality in the population with a probability plot. Construct a 99% confidence interval for the standard deviation over the sites of
the mean annual temperatures.
Site
Average Mean Temperature (°C)
Sky Harbor Airport 23.3
Phoenix Greenway 21.7
Phoenix Encanto 21.6
21.7
Waddell
Litchfield
Laveen
Maricopa
Harlquahala
21.3
i
20.7
20.9
20.1…
arrow_forward
You may need to use the appropriate technology to answer this question.
An automobile dealer conducted a test to determine if the time in minutes needed to complete a minor engine tune-up depends on whether a computerized engine analyzer or an electronic analyzer is used. Because tune-up time varies among compact, intermediate, and full-sized cars, the three types of cars were used as blocks in the experiment. The data obtained follow.
Analyzer
computerized
electronic
Car
compact
50
41
Intermediate
56
45
Full Sized
62
46
Use ? = 0.05 to test for any significant differences.
State the null and alternative hypotheses.
H0: ?Computerized = ?ElectronicHa: ?Computerized ≠ ?ElectronicH0: ?Computerized ≠ ?ElectronicHa: ?Computerized = ?Electronic H0: ?Computerized = ?Electronic = ?Compact = ?Intermediate = ?Full-sizedHa: Not all the population means are equal.H0: ?Compact = ?Intermediate = ?Full-sizedHa: ?Compact ≠ ?Intermediate ≠ ?Full-sizedH0:…
arrow_forward
use calculator method
arrow_forward
A new product made from recycled plastic soda bottles needs 18
labor hours to build the first prototype. The production operates at
a 86.5 % learning curve rate. The average time to complete the first
five prototypes is
hours.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Plzz explain.arrow_forwardWhat is the simultaneous equation bias? Give an example? What are the techniques used to estimate such model? What are the necessary conditions that are required to validly estimate the original models parameters?arrow_forwardPlease do not give solution in image format thankuarrow_forward
- Is there a relationship between total team salary and the performance of football teams? For a recent season, a linear model predicting Wins (out of 16 regular season games) from the total team Salary (SM) for 32 teams in a football league is Wins = -6.353 +0.105 Salary. Complete parts a through h below. a) What is the explanatory variable? The explanatory variable is because b) What is the response variable? The response variable is because c) What does the slope mean in this context? in this league, team (Type an integer or a decimal. Do not round.) are, on average, about higher for every d) What does the y-intercept mean in this context? Is it meaningful? V is This v meaningful because it The y-intercept is the average of a team in this league whose is (Type an integer or a decimal. Do not round.) e) If one team spends $10 million more than another on salary, how many more games on average would the first team be predicted to win? O game(s) (Type an integer or a decimal. Do not…arrow_forwardQuestion 1 part Darrow_forwardpls help ASAParrow_forward
- Clocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.arrow_forwardQuestion 1 part B. Thanksarrow_forwardA random sample of n = 25 students in Gwinnett County schools were chosen to participate in a study about remote / digital study. Of the 25 students, 20 reported the Google Classroom provides the best access to digital content. %3D Which statement best describes the population and true parameter of this scenario? O The population is all students in Gwinnett County school; the parameter is the proportion of students that believe Google Classroom provides the best access to digital content The population is all students who participated in the remote / digital learning in Gwinnett County; the parameter is reported proportion of students that believe Google Classroom provides the best access to digital content The population is all students who participated in the remote / digital learning3B the parameter is the true proportion of students that believe Google Classroom provides the best access to digital content O The population is the 25 students who participated in the remote / digital…arrow_forward
- Can you answer A,B,C with clear answers. You can use the data in the second photoarrow_forwardState the four conditions required for making regression inferences.arrow_forwardThe Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined. Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit. Discuss why regression analysis is important in decision-making.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Linear Algebra: A Modern IntroductionAlgebraISBN:9781285463247Author:David PoolePublisher:Cengage LearningElementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt