a3-solution

.pdf

School

Rumson Fair Haven Reg H *

*We aren’t endorsed by this school

Course

101

Subject

Statistics

Date

Nov 24, 2024

Type

pdf

Pages

4

Uploaded by CoachRiverTiger30

Assignment 3: Linear/Quadratic Discriminant Analysis and Comparing Classification Methods SDS293 - Machine Learning Due: 11 Oct 2017 by 11:59pm Conceptual Exercises 4.5 (p. 169 ISLR) This question examines the differences between LDA and QDA. (a) If the Bayes decision boundary is linear , do we expect LDA or QDA to perform better on the training set? On the test set? Solution: We would expect QDA to perform better on the training set because its increased flexiblity will result in a closer fit. If the Bayes decision boundary is linear, we expect LDA to perform better than QDA on the test set, as QDA could be subject to overfitting. (b) If the Bayes decision boundary is non-linear , do we expect LDA or QDA to perform better on the training set? On the test set? Solution: If the Bayes decision bounary is non-linear, we expect QDA to perform better on both the training and test sets. (c) In general, as the sample size n increases , do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why? Solution: We expect the test prediction accuracy of QDA relative to LDA to improve as n gets bigger. In general, as the the sample size increases, a more flexibile method will yield a better fit as the variance is offset by the larger sample size. (d) True or False : Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer. Solution: False. With fewer sample points, the variance from using a more flexible method, such as QDA, would likely result in overfitting, yielding a higher test error rate than LDA. 1
Applied Exercises 4.10 (p. 171 ISLR) This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. (a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns ? Solution: Year and Volume appear to have a relationship. No other patterns are discernible. (b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors, and use the summary() function to print the results. Do any of the predictors appear to be statistically significant ? If so, which ones? Solution: Lag2 appears to have some statistical significance with Pr ( > | z | ) = 3% . (c) Compute the confusion matrix and overall fraction of correct predictions. What is the con- fusion matrix is telling you about the types of mistakes made by your logistic model? Solution: Percentage of correct predictions: (54 + 557) / (54 + 557 + 48 + 430) = 56 . 1% On weeks where the market goes down, the logistic regression is right most of the time: 557 / (557 + 48) = 92 . 1% However, on weeks the market goes down the logistic regression is wrong most of the time: 54 / (430 + 54) = 11 . 2% (d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Report the confusion matrix and the overall fraction of correct predictions for the test data (that is, the data from 2009 and 2010). Solution: glm.pred Down Up Down 9 5 Up 34 56 mean: 0.625 (e) Repeat (d) using LDA. Solution: Same as logistic regression. 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help