a3-solution
pdf
keyboard_arrow_up
School
Rumson Fair Haven Reg H *
*We aren’t endorsed by this school
Course
101
Subject
Statistics
Date
Nov 24, 2024
Type
Pages
4
Uploaded by CoachRiverTiger30
Assignment 3: Linear/Quadratic Discriminant Analysis and
Comparing Classification Methods
SDS293 - Machine Learning
Due: 11 Oct 2017 by 11:59pm
Conceptual Exercises
4.5 (p. 169 ISLR)
This question examines the differences between LDA and QDA.
(a) If the Bayes decision boundary is
linear
, do we expect LDA or QDA to perform better on
the training set? On the test set?
Solution:
We would expect QDA to perform better on the training set because its increased
flexiblity will result in a closer fit. If the Bayes decision boundary is linear, we expect LDA
to perform better than QDA on the test set, as QDA could be subject to overfitting.
(b) If the Bayes decision boundary is
non-linear
, do we expect LDA or QDA to perform better
on the training set? On the test set?
Solution:
If the Bayes decision bounary is non-linear, we expect QDA to perform better on
both the training and test sets.
(c) In general, as the sample size
n
increases
, do we expect the test prediction accuracy of QDA
relative to LDA to improve, decline, or be unchanged? Why?
Solution:
We expect the test prediction accuracy of QDA relative to LDA to improve as n
gets bigger. In general, as the the sample size increases, a more flexibile method will yield a
better fit as the variance is offset by the larger sample size.
(d)
True or False
: Even if the Bayes decision boundary for a given problem is linear, we will
probably achieve a superior test error rate using QDA rather than LDA because QDA is
flexible enough to model a linear decision boundary. Justify your answer.
Solution:
False. With fewer sample points, the variance from using a more flexible method,
such as QDA, would likely result in overfitting, yielding a higher test error rate than LDA.
1
Applied Exercises
4.10 (p. 171 ISLR)
This question should be answered using the
Weekly
data set, which is part of the
ISLR
package.
This data is similar in nature to the
Smarket
data from this chapter’s lab, except that it contains
1,089
weekly
returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the
Weekly
data. Do there appear to
be any
patterns
?
Solution:
Year
and
Volume
appear to have a relationship. No other patterns are discernible.
(b) Use the full data set to perform a logistic regression with
Direction
as the response and the
five
lag
variables plus
Volume
as predictors, and use the
summary()
function to print the
results. Do any of the predictors appear to be
statistically significant
? If so, which ones?
Solution:
Lag2
appears to have some statistical significance with
Pr
(
>
|
z
|
) = 3%
.
(c) Compute the confusion matrix and overall fraction of correct predictions. What is the con-
fusion matrix is telling you about the
types of mistakes
made by your logistic model?
Solution:
Percentage of correct predictions:
(54 + 557)
/
(54 + 557 + 48 + 430) = 56
.
1%
On weeks where the market goes down, the logistic regression is right most of the time:
557
/
(557 + 48) = 92
.
1%
However, on weeks the market goes down the logistic regression is wrong most of the time:
54
/
(430 + 54) = 11
.
2%
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with
Lag2
as the only predictor. Report the confusion matrix and the overall fraction of correct
predictions for the
test data
(that is, the data from 2009 and 2010).
Solution:
glm.pred Down Up
Down 9
5
Up 34
56
mean: 0.625
(e) Repeat (d) using LDA.
Solution:
Same as logistic regression.
2
(f) Repeat (d) using QDA.
Solution:
glm.pred Down Up
Down 0
0
Up 43
61
mean: 0.587
A correctness of 58.7% even though it picked Up the whole time!
(g) Repeat (d) using KNN with
K
= 1.
Solution:
glm.pred Down Up
Down 21
30
Up 22
31
mean: 0.5
(h) Which of these methods appears to provide the best results on this data?
Solution:
Logistic regression and LDA methods both provide equally low test error rates.
(i) Experiment with different combinations of predictors, including possible transformations and
interactions, for each of the methods. You should also experiment with values for
K
in the
KNN classifier. Report the predictors, method, and associated confusion matrix that appears
to provide the best results on the held out data. Why do you think this one performed best?
Solution:
This problem will have different solutions depending on which combinations you
tried.
Variation of 4.13 (p. 173 ISLR)
Using the
Boston
data set from
ISLR
, fit a classification model in order to predict whether a given
suburb has a
crime rate
above or below the median. You may want to explore logistic regression,
LDA, and KNN models using various subsets of the predictors.
Once you’re satisfied with your results, describe your model and findings:
•
Why did you choose that type of model?
•
How did you choose your predictors?
•
What does your model it tell you about the data?
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
•
Where does it break down?
•
Is there additional information that you would need to know to be able to make a better
model?
Solution:
This problem will have different solutions depending on which combinations you tried.
Interesting solutions will be anonymized and made available after grading is complete.
4
Related Questions
Use the R-generated data set shown to identify the variable of interest and the measurement scale that was used to obtain the measurements.
arrow_forward
Q22
arrow_forward
[Econometrics] The World Bank hires you to analyze the effect of the introduction of new irrigation systems in a series of Sub-Saharan African countries. They are particularly interested on their effect in the crop production of each country. They give you data for each country on the kilometers of irrigation tubes constructed, the total crop harvest, as well as information on the average educational attainment for each year between 2000 and 2017. Which type of analysis would you perform based on this information?
arrow_forward
We expect a car's highway gas mileage to be related to its
city gas mileage (in miles per gallon, mpg). Data for all
1259 vehicles in the government's 2019 Fuel Economy
Guide give the regression line
highway mpg = 8.720 + (0.914x city mpg)
for predicting highway mileage from city mileage.
1
O Macmillan Learning
(b) What is the intercept? Give your answer to three
decimal places.
intercept:
Why is the value of the intercept not
statistically meaningful?
The value of the intercept is an average value
calculated from a sample.
The value of the intercept represents the predicted
highway mileage for city gas mileage of 0 mpg,
and such a prediction would be invalid since 0 is
outside the range of the data.
The value of the intercept represents the predicted
highway mileage for slope 0.
O The value of the intercept represents the predicted
city mileage for highway gas mileage of 0 mpg,
and such a car does not exist.
mpg
arrow_forward
2a) • Using a Graphing calculator or spreadsheet program create a least squares regression line of the tuition for 4 years versus the average salary after 10 years, define any variables that you used. • Identify and interpret the correlation coefficient and coefficient of determination. • By looking at the least squares regression line, determine the college that you believe is the best value. Explain your reasoning. • Using the college (Harvard University) you chose, identify and interpret its residual based on the least squares regression model.
arrow_forward
Clocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.
arrow_forward
Clocking the Cheetah. The cheetah (Acinonyx jubatus)isthe fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Envi-ronment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. The WeissStats site contains the top speeds, in miles per hour, for a sample of 35 chee-tahs. Use the technology of your choice to do the following tasks. a. Find a 95% confidence interval for the mean top speed, μ,ofall cheetahs. Assume that the population standard deviation of top speeds is 3.2 mph. d. Comment on the advisability of using the z-interval procedure on these data.
arrow_forward
i want Solution Part 1, 2
arrow_forward
Tire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model
What is the predicted average mileage at tire pressure x = 31?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Use the R-generated data set shown to identify the variable of interest and the measurement scale that was used to obtain the measurements.arrow_forwardQ22arrow_forward[Econometrics] The World Bank hires you to analyze the effect of the introduction of new irrigation systems in a series of Sub-Saharan African countries. They are particularly interested on their effect in the crop production of each country. They give you data for each country on the kilometers of irrigation tubes constructed, the total crop harvest, as well as information on the average educational attainment for each year between 2000 and 2017. Which type of analysis would you perform based on this information?arrow_forward
- We expect a car's highway gas mileage to be related to its city gas mileage (in miles per gallon, mpg). Data for all 1259 vehicles in the government's 2019 Fuel Economy Guide give the regression line highway mpg = 8.720 + (0.914x city mpg) for predicting highway mileage from city mileage. 1 O Macmillan Learning (b) What is the intercept? Give your answer to three decimal places. intercept: Why is the value of the intercept not statistically meaningful? The value of the intercept is an average value calculated from a sample. The value of the intercept represents the predicted highway mileage for city gas mileage of 0 mpg, and such a prediction would be invalid since 0 is outside the range of the data. The value of the intercept represents the predicted highway mileage for slope 0. O The value of the intercept represents the predicted city mileage for highway gas mileage of 0 mpg, and such a car does not exist. mpgarrow_forward2a) • Using a Graphing calculator or spreadsheet program create a least squares regression line of the tuition for 4 years versus the average salary after 10 years, define any variables that you used. • Identify and interpret the correlation coefficient and coefficient of determination. • By looking at the least squares regression line, determine the college that you believe is the best value. Explain your reasoning. • Using the college (Harvard University) you chose, identify and interpret its residual based on the least squares regression model.arrow_forwardClocking the Cheetah. The cheetah (Acinonyx jubatus) is the fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Environment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. Following is a frequency histogram for the speeds, in miles per hour, for a sample of 35 cheetahs.arrow_forward
- Clocking the Cheetah. The cheetah (Acinonyx jubatus)isthe fastest land mammal and is highly specialized to run down prey. The cheetah often exceeds speeds of 60 mph and, according to the online document “Cheetah Conservation in Southern Africa” (Trade&Envi-ronment Database (TED) Case Studies, Vol. 8, No. 2) by J. Urbaniak, the cheetah is capable of speeds up to 72 mph. The WeissStats site contains the top speeds, in miles per hour, for a sample of 35 chee-tahs. Use the technology of your choice to do the following tasks. a. Find a 95% confidence interval for the mean top speed, μ,ofall cheetahs. Assume that the population standard deviation of top speeds is 3.2 mph. d. Comment on the advisability of using the z-interval procedure on these data.arrow_forwardi want Solution Part 1, 2arrow_forwardTire pressure (psi) and mileage (mpg) were recorded for a random sample of seven cars of thesame make and model. The extended data table (left) and fit model report (right) are based on aquadratic model What is the predicted average mileage at tire pressure x = 31?arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Linear Algebra: A Modern IntroductionAlgebraISBN:9781285463247Author:David PoolePublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
Linear Algebra: A Modern Introduction
Algebra
ISBN:9781285463247
Author:David Poole
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt