a5-solution
pdf
keyboard_arrow_up
School
Rumson Fair Haven Reg H *
*We aren’t endorsed by this school
Course
101
Subject
Statistics
Date
Nov 24, 2024
Type
Pages
5
Uploaded by CoachRiverTiger30
Assignment 5: Linear Model Selection
SDS293 - Machine Learning
Due: 24 Oct 2017 by 11:59pm
Conceptual Exercises
6.8.1 (p. 259 ISLR)
We perform best subset, forward stepwise, and backward stepwise selection on a single data set.
For each approach, we obtain
p
+1 models, containing 0
,
1
,
2
, ..., p
predictors. Explain your answers:
(a) Which of the three models with
k
predictors has the smallest
training RSS
?
Solution:
Best subset selection has the smallest training RSS. Both forward and backward
selection determine models that depend on which predictors they pick first as they iterate
toward the
k
th
model, meaning that a poor choice early on cannot be undone.
(b) Which of the three models with k predictors has the smallest
test RSS
?
Solution:
Best subset selection
may
have the smallest test RSS because it considers more
models then the other methods. However, the other models might have better luck picking a
model that fits the test data better, as they would be less subject to overfitting. The outcome
will depend more heavily on the choice of test set / validation method than on the selection
method.
(c) True or False: the predictors in Model 1
are a subset of
the predictors in Model 2:
Model 1
Model 2
T/F
i.
Forward selection,
k
variables
Forward selection,
k
+ 1 variables
True
ii.
Backward selection,
k
variables
Backward selection,
k
+ 1 variables
True
iii.
Backward selection,
k
variables
Forward selection,
k
+ 1 variables
False
iv.
Forward selection,
k
variables
Backward selection,
k
+ 1 variables
False
v.
Best subset selection,
k
variables
Best subset selection,
k
+ 1 variables
False
Explain your reasoning.
1
Applied Exercises
6.8.8 parts a-d (p. 262-263 ISLR)
In this exercise, we will generate simulated data, and will then use this data to perform best subset
selection.
(a) Generate a predictor
X
of length n=100, as well as a noise vector
of length n=100.
Solution:
> set.seed(1)
> X=rnorm(100)
> eps=rnorm(100)
(b) Generate a response vector
Y
of length n=100 according to the model
Y
=
β
0
+
β
1
*
X
+
β
2
*
X
2
+
β
3
*
X
3
+
where
β
0
,
β
1
,
β
2
, and
β
3
are constants of your choice.
Solution:
Selecting
β
0
= 3
,
β
1
= 2
,
β
2
=
-
3
and
β
3
= 0
.
3
:
> beta0=3
> beta1=2
> beta2=-3
> beta3=0.3
> Y=beta0 + beta1 * X + beta2 * X
^
2 + beta3 * X
^
3 + eps
(c) Perform best subset selection in order to choose the best model containing the predictors
X, X
2
, ..., X
10
.
What is the best model obtained according to Cp, BIC, and adjusted
R
2
?
Show some plots to provide evidence for your answer, and report the coefficients of the best
model obtained.
Solution:
> library(leaps)
> data.full=data.frame(y=Y, x=X)
> mod.full=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10)
> mod.summary=summary(mod.full)
# Find the model size for best cp, BIC and adjr2
> min.cp=which.min(mod.summary
$
cp)
> min.bic=which.min(mod.summary
$
bic)
> max.adjr2=which.max(mod.summary
$
adjr2)
# Plot cp, BIC and adjr2
> plot(mod.summary
$
cp, xlab="Subset Size", ylab="Cp", pch=20, type="l")
> points(min.cp, mod.summary
$
cp[min.cp], pch=4, col="red", lwd=7)
> plot(mod.summary
$
bic, xlab="Subset Size", ylab="BIC", pch=20, type="l")
2
> points(min.bic, mod.summary
$
bic[min.bic], pch=4, col="red", lwd=7)
> plot(mod.summary
$
adjr2, xlab="Subset Size", ylab="adjr2", pch=20, type="l")
> points(max.adjr2, mod.summary
$
adjr2[max.adjr2], pch=4, col="red", lwd=7)
We find that all three criteria (Cp, BIC and Adjusted R2) criteria select 3-variable models.
The coefficients of the best 3-variable model are:
> coefficients(mod.full, id=3)
(Intercept)
poly(x, 10, raw=T)1
poly(x, 10, raw=T)2
poly(x, 10, raw=T)7
3.07627412
2.35623596
-3.16514887
0.01046843
(d) Repeat (c), using forward stepwise selection and also using backward stepwise selection. How
does your answer compare to the results in (c)?
Solution:
> mod.fwd=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10, method="forward")
> mod.bwd=regsubsets(y
∼
poly(x, 10, raw=T), data=data.full, nvmax=10, method="backward")
> fwd.summary=summary(mod.fwd)
> bwd.summary=summary(mod.bwd)
# Find best forward-selected model size
> min.cp.f=which.min(fwd.summary
$
cp)
> min.bic.f=which.min(fwd.summary
$
bic)
> max.adjr2.f=which.max(fwd.summary
$
adjr2)
# Find best backward-selected model size
> min.cp.b=which.min(bwd.summary
$
cp)
> min.bic.b=which.min(bwd.summary
$
bic)
> max.adjr2.b=which.max(bwd.summary
$
adjr2)
# Plot the statistics
> par(mfrow=c(3, 2))
# Forward Cp
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
> plot(fwd.summary
$
cp, xlab="Subset Size", ylab="Fwd Cp", pch=20, type="l")
> points(min.cp.f, fwd.summary
$
cp[min.cp.f], pch=4, col="red", lwd=7)
# Backward Cp
> plot(bwd.summary
$
cp, xlab="Subset Size", ylab="Bwd Cp", pch=20, type="l")
> points(min.cp.b, bwd.summary
$
cp[min.cp.b], pch=4, col="red", lwd=7)
# Forward BIC
> plot(fwd.summary
$
bic, xlab="Subset Size", ylab="Fwd BIC", pch=20, type="l")
> points(min.bic.f, fwd.summary
$
bic[min.bic.f], pch=4, col="red", lwd=7)
# Backward BIC
> plot(bwd.summary
$
bic, xlab="Subset Size", ylab="Bwd BIC", pch=20, type="l")
> points(min.bic.b, bwd.summary
$
bic[min.bic.b], pch=4, col="red", lwd=7)
# Forward Adj R
^
2
> plot(fwd.summary
$
adjr2, xlab="Subset Size", ylab="Fwd adjr2", pch=20, type="l")
> points(max.adjr2.f, fwd.summary
$
adjr2[max.adjr2.f], pch=4, col="red", lwd=7)
# Backward Adj R
^
2
> plot(bwd.summary
$
adjr2, xlab="Subset Size", ylab="Bwd adjr2", pch=20, type="l")
> points(max.adjr2.b, bwd.summary
$
adjr2[max.adjr2.b], pch=4, col="red", lwd=7)
We see that all statistics pick 3-variable models except backward selection with adjusted R2.
Here are the coefficients:
> coefficients(mod.fwd, id = 3)
4
(Intercept)
poly(x, 10)1
poly(x, 10)2
poly(x, 10)7
3.07627412
2.35623596
-3.16514887
0.01046843
> coefficients(mod.bwd, id = 3)
(Intercept)
poly(x, 10)1
poly(x, 10)2
poly(x, 10)9
3.078881355
2.419817953
-3.177235617
0.001870457
> coefficients(mod.bwd, id = 4)
(Intercept)
poly(x, 10)1
poly(x, 10)2
poly(x, 10)4
poly(x, 10)5
3.12902640
2.27105667
-3.32284363
0.04320229
0.05388957
Here forward stepwise picks X7 over X3. Backward stepwise with 3 variables picks X9 while
backward stepwise with 4 variables picks X4 and X7.
5
Related Documents
Related Questions
Plz solve it correctly I vill give 4 upvotes.
arrow_forward
Explain whether each scenario below is a regression, classification, or unsupervised learn-
ing problem. If it is a supervised learning scenario, indicate whether we are more interested
in inference or prediction. Finally, provide in each case the number of observations, n,
and the number of predictors, p.
(1) An online retailer must decide whether to display advertisement A or advertisement
B to each customer on the basis of collected customer demographics (age, income,
zip code, and gender). A set of 150 of its customers have already expressed a
preference for one advertisement or the other.
(2) A policy analyst is interested in discovering factors that are associated with the
crime rate across different U.S. cities. For each of 500 cities, the policy analyst
gathers the following data: the crime rate, unemployment rate, population, median
income, median home price, and state.
(3) The
the channel owner to see where the subscribers are located, their age and gender, the
times and days…
arrow_forward
Clocking the Cheetah. The cheetah (Acinonyx jubatus)isthe fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Envi-ronment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. The WeissStats site contains the top speeds, in miles per hour, for a sample of 35 chee-tahs. Use the technology of your choice to do the following tasks. a. Find a 95% confidence interval for the mean top speed, μ,ofall cheetahs. Assume that the population standard deviation of top speeds is 3.2 mph. d. Comment on the advisability of using the z-interval procedure on these data.
arrow_forward
Clocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.
arrow_forward
A random sample of n = 25 students in Gwinnett County schools were chosen to
participate in a study about remote / digital study. Of the 25 students, 20 reported
the Google Classroom provides the best access to digital content.
%3D
Which statement best describes the population and true parameter of this scenario?
O The population is all students in Gwinnett County school; the parameter is the
proportion of students that believe Google Classroom provides the best access
to digital content
The population is all students who participated in the remote / digital learning in
Gwinnett County; the parameter is reported proportion of students that
believe Google Classroom provides the best access to digital content
The population is all students who participated in the remote / digital learning3B
the parameter is the true proportion of students that believe Google Classroom
provides the best access to digital content
O The population is the 25 students who participated in the remote / digital…
arrow_forward
geBoard
Consumable Student Editic
Name: tyr poPrsrey Class:
Date:
Version: J
Algebra I EOC Review #3 WS Due 4-15-21(ALL WORK MUST BE SHOWN) In-person must
get approved by teacher before submitting in Classroom. Online may submit, then resubmit onc
Must show all work or explain steps(in detail) to receive any credit.
1. The data set shows the amount of funds raised and the number of participants in the fundraiser at the Family
House organization branches. Use a graphing calculator to find and graph an equation of the least-squares line
the data.(Linear Regression)
Family House Fundraiser
# of participants
Funds raised (S)
10
15
20
25
13
15
18
490 500 | 550 570 630 520 550 560
Yünu'a fomihy is staving at a campground th
arrow_forward
Please help on all parts of question 2 and all parts of question 3. Thank you!
arrow_forward
The Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined.
Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit.
Discuss why regression analysis is important in decision-making.
arrow_forward
A'6
arrow_forward
use calculator method
arrow_forward
Calculate the best (most complex/sophisticated/stable) measure of central tendency allowed for the following data. The variable is favorite month (Where January = 1, February = 2, etc.) Explain.
3, 9, 9, 4, 2, 7, 1
arrow_forward
P Do Homework - Getting Started: Math Review - Google Chrome
A mathxl.com/Student/PlayerHomework.aspx?homeworkld%3D611294897&questionld%3D9&flushed%=true&cld%3D6721186¢erwin=Dyes
ITMG 1B Econ 2100
E Homework: Getting Started: Math Review
Question 4, 2.4 Study Ex...
HW Score: 2.82%, 2 of 7
Part 1 of 8
O Points: 0 of 10
The following questions refer to a coordinate graph with the variable X on the horizontal axis and the variable Y on the vertical axis. Complete parts (a) through (h).
a. If two points on a line are (X = 4, Y = 3) and (X= 12, Y = 5), what is the slope of the line?
The slope of the line is (Enter your response as an integer or a fraction. Simplify your answer.)
Grapher
Get More Help-
e Text Pages
99+
here to search
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Plz solve it correctly I vill give 4 upvotes.arrow_forwardExplain whether each scenario below is a regression, classification, or unsupervised learn- ing problem. If it is a supervised learning scenario, indicate whether we are more interested in inference or prediction. Finally, provide in each case the number of observations, n, and the number of predictors, p. (1) An online retailer must decide whether to display advertisement A or advertisement B to each customer on the basis of collected customer demographics (age, income, zip code, and gender). A set of 150 of its customers have already expressed a preference for one advertisement or the other. (2) A policy analyst is interested in discovering factors that are associated with the crime rate across different U.S. cities. For each of 500 cities, the policy analyst gathers the following data: the crime rate, unemployment rate, population, median income, median home price, and state. (3) The the channel owner to see where the subscribers are located, their age and gender, the times and days…arrow_forwardClocking the Cheetah. The cheetah (Acinonyx jubatus)isthe fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Envi-ronment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. The WeissStats site contains the top speeds, in miles per hour, for a sample of 35 chee-tahs. Use the technology of your choice to do the following tasks. a. Find a 95% confidence interval for the mean top speed, μ,ofall cheetahs. Assume that the population standard deviation of top speeds is 3.2 mph. d. Comment on the advisability of using the z-interval procedure on these data.arrow_forward
- Clocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.arrow_forwardA random sample of n = 25 students in Gwinnett County schools were chosen to participate in a study about remote / digital study. Of the 25 students, 20 reported the Google Classroom provides the best access to digital content. %3D Which statement best describes the population and true parameter of this scenario? O The population is all students in Gwinnett County school; the parameter is the proportion of students that believe Google Classroom provides the best access to digital content The population is all students who participated in the remote / digital learning in Gwinnett County; the parameter is reported proportion of students that believe Google Classroom provides the best access to digital content The population is all students who participated in the remote / digital learning3B the parameter is the true proportion of students that believe Google Classroom provides the best access to digital content O The population is the 25 students who participated in the remote / digital…arrow_forwardgeBoard Consumable Student Editic Name: tyr poPrsrey Class: Date: Version: J Algebra I EOC Review #3 WS Due 4-15-21(ALL WORK MUST BE SHOWN) In-person must get approved by teacher before submitting in Classroom. Online may submit, then resubmit onc Must show all work or explain steps(in detail) to receive any credit. 1. The data set shows the amount of funds raised and the number of participants in the fundraiser at the Family House organization branches. Use a graphing calculator to find and graph an equation of the least-squares line the data.(Linear Regression) Family House Fundraiser # of participants Funds raised (S) 10 15 20 25 13 15 18 490 500 | 550 570 630 520 550 560 Yünu'a fomihy is staving at a campground tharrow_forward
- Please help on all parts of question 2 and all parts of question 3. Thank you!arrow_forwardThe Ministry of Tourism in Trinidad and Tobago is interested in developing a campaign to increase the number of visitors to the island. The Ministry in collaboration with the island’s hotels collected data to be used as a guide to determine what steps should be taken going forward. Using the data in the Microsoft Excel file attached you are required to use the knowledge you have acquired during the semester to answer the following question. Ensure that your responses are detailed and all the necessary steps are clearly outlined. Derive a model for the estimation of the probability of returning to the island from the average money spent during the visit. Discuss why regression analysis is important in decision-making.arrow_forwardA'6arrow_forward
- use calculator methodarrow_forwardCalculate the best (most complex/sophisticated/stable) measure of central tendency allowed for the following data. The variable is favorite month (Where January = 1, February = 2, etc.) Explain. 3, 9, 9, 4, 2, 7, 1arrow_forwardP Do Homework - Getting Started: Math Review - Google Chrome A mathxl.com/Student/PlayerHomework.aspx?homeworkld%3D611294897&questionld%3D9&flushed%=true&cld%3D6721186¢erwin=Dyes ITMG 1B Econ 2100 E Homework: Getting Started: Math Review Question 4, 2.4 Study Ex... HW Score: 2.82%, 2 of 7 Part 1 of 8 O Points: 0 of 10 The following questions refer to a coordinate graph with the variable X on the horizontal axis and the variable Y on the vertical axis. Complete parts (a) through (h). a. If two points on a line are (X = 4, Y = 3) and (X= 12, Y = 5), what is the slope of the line? The slope of the line is (Enter your response as an integer or a fraction. Simplify your answer.) Grapher Get More Help- e Text Pages 99+ here to searcharrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Linear Algebra: A Modern IntroductionAlgebraISBN:9781285463247Author:David PoolePublisher:Cengage LearningElementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt