HW3

pdf

School

Washington State University *

*We aren’t endorsed by this school

Course

MISC

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

13

Uploaded by GrandDeerMaster1061

Report
1. Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent. 4.2: p(X) = (eˆ(B0 + B1X)) / (1 + eˆ(B0 + B1X)) 4.3: (p(X) / (1- p(X))) = eˆ(B0 + B1X) To prove 4.2 is equivalent to 4.3, we can use the equation p(X) = (eˆ(B0 + B1X)) / (1 + eˆ(B0 + B1X)) (1 + eˆ(B0 + B1X)) p(X) = eˆ(B0 + B1X) p(X) + p(X)eˆ(B0 + B1X) - p(X)eˆ(B0 + B1X) p(X) = eˆ(B0 + B1X) - p(X)eˆ(B0 + B1X) p(X) = eˆ(B0 + B1X) (1 - p(X)) (p(X) / (1- p(X))) = eˆ(B0 + B1X) Therefore 4.2 is equivalent to 4.3 2. It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an ob- servation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observa- tions in the kth class are drawn from a N(uk, o2) distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized. 4.12: pk(x) = ((pi)k 1/sqrt((2(pi)))o exp(- 1/2o2 (x - uk)ˆ2) / ((o K, l = 1) (pi)l (1/(sqrt(2(pi)))o exp(- 1/(2oˆ2) (x - ul)ˆ2))) By removing all terms independent of the class & from the Bayes classifier and also removing the denominator since it will retain the same value for any class k, we find log Pk(X) = log(pi)k - (1/2oˆ2) * (X - uk)ˆ2 = log(pi)k - ((xˆ2 / 2oˆ2)- (x * uk/ oˆ2) +(uˆ2_k / 2oˆ2)) 4.13: (delta)k(x) = x * uk/oˆ2 - u 2_k/2o 2 + log((pi)k) The above shows that the discriminant function contains the same terms as the Bayes classifier that are conditional with respect to k So we can deduce that the class that maximizes 4.12 will also be the class that maximizes 4.13 5. We now examine the differences between LDA and QDA. #(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set? LDA is expected to perform better than QDA on both the training set and test set. When the decision boundary is linear, this assumption of having a common variance-covariance matrix simplifies the model, and LDA’s linear decision boundary should fit the training data well. #(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set? QDA is expected to perform better than LDA on both the training set and test set. In QDA class-conditional densities have unequal covariance matrices. This captures the non-linear decision boundaries more effectively than the LDA. 1
#(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why? As n increases, the accuracy of QDA improves more than that of LDA. This is because QDA has more data to estimate the means and covariance matrices accurately. (d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer. False. If the Bayes decision boundary for a problem is linear, using QDA is not likely to achieve a better test error rate compared to LDA.This only leads to a worse performance since QDA is a more flexible model that can capture non-linear decision boundaries, but not necessarily the linear decision boundary. 6. Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, B0 = -6, B1 = 0.05, B2 = 1. (a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class. P(Y = 1) = 1 / (1 + eˆ-(B0 + B1X1 + B2X2)) B0 = -6 B1 = 0.05 B2 = 1 X1 = 40 X2 = 3.5 P(Y = 1) = 1 / (1 + eˆ-(-6 + (0.05)(40) + 1(3.5))) Which, when entered into a calculator gives us 0.6225. Therefore, there is a 37.75% chance that the student in question gets an A. (b) How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class? P(Y = 1) = 0.5 1 / 1 + eˆ-z = 0.5 1 + eˆ-z = 2 eˆ-z = 1 -z = 0 z = 0 B0 + B1X1 + B2X2 = z B0 + B1X1 + B2X2 = 0 -6 + (0.05)(X1) - 1(3.5) = 0 Solving for X1 (time in hours), we get 50 hours for the student in question to have a 50% chance of getting an A 2
7. Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was X = 10, while the mean for those that didn’t was X = 0. In addition, the variance of X for these two sets of companies was o2 = 36. Finally, 80 % of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year. #Hint: Recall that the density function for a normal random variable is f(x) = 1 / sqrt((2(pi)oˆ2)) e (-(x-u) 2/2o2) - You will need to use Bayes’ theorem. X1 = 10 X2 = 0 o2 = 36 (pi)1 = 0.8 (pi)2 = 0.2 f1(4) = 0.0433 f2(4) = 0.05324 p1(x=4)=(pi)1(f1(4)) / ((pi)1(f1(4)) + (pi)2(f2(x))) p1(x=4)=0.8(0.0433) / (0.8(0.0433) + 0.2(0.05324)) if the percentage of profit was 4%. the chance of a dividend being offered is 0.75185, or 75.2% 10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010. library (ISLR) ## Warning: package ’ISLR’ was built under R version 4.2.3 data ( "Weekly" ) library (corrplot) ## corrplot 0.92 loaded #(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns? No real patterns I can see. And nothing really appears to have any correlation. summary (Weekly) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Year Lag1 Lag2 Lag3 ## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 ## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 ## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 ## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 ## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 ## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 ## Lag4 Lag5 Volume Today ## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950 ## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 ## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410 ## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499 ## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050 ## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260 ## Direction ## Down:484 ## Up :605 ## ## ## ## pairs (Weekly) Year -15 -15 -15 -15 1990 2010 -15 5 Lag1 Lag2 -15 5 -15 5 Lag3 Lag4 -15 5 -15 5 Lag5 Volume 0 4 8 -15 5 Today 1990 -15 -15 0 8 1.0 1.6 1.0 Direction 4
plot (Weekly $ Year, Weekly $ Volume, type = "l" , xlab = "Year" , ylab = "Volume" ) 1990 1995 2000 2005 2010 0 2 4 6 8 Year Volume corrplot ( cor (Weekly[, - 9 ]), method= "square" ) 5
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today #(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? Lag2 appears to be significant at the 0.05 significance level. model <- glm (Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial) summary (model) ## ## Call: ## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + ## Volume, family = binomial, data = Weekly) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6949 -1.2565 0.9913 1.0849 1.4579 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.26686 0.08593 3.106 0.0019 ** ## Lag1 -0.04127 0.02641 -1.563 0.1181 ## Lag2 0.05844 0.02686 2.175 0.0296 * ## Lag3 -0.01606 0.02666 -0.602 0.5469 ## Lag4 -0.02779 0.02646 -1.050 0.2937 ## Lag5 -0.01447 0.02638 -0.549 0.5833 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## Volume -0.02274 0.03690 -0.616 0.5377 ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1496.2 on 1088 degrees of freedom ## Residual deviance: 1486.4 on 1082 degrees of freedom ## AIC: 1500.4 ## ## Number of Fisher Scoring iterations: 4 #(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression. Up weekly trends are correctly predicted much more accurately than down weekly trends. As demonstrated by 557 / 48 + 557 = 0.9207 54 / 430 + 54 = 0.1115 predicted_direction <- ifelse ( predict (model, type = "response" ) > 0.5 , "Up" , "Down" ) confusion_matrix <- table ( Actual = Weekly $ Direction, Predicted = predicted_direction) correct_predictions <- sum ( diag (confusion_matrix)) / sum (confusion_matrix) confusion_matrix ## Predicted ## Actual Down Up ## Down 54 430 ## Up 48 557 correct_predictions ## [1] 0.5610652 #(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010). training_data <- Weekly[Weekly $ Year < 2009 , ] testing_data <- Weekly[Weekly $ Year >= 2009 , ] model2 <- glm (Direction ~ Lag2, data = training_data, family = binomial) predicted_direction2 <- ifelse ( predict (model2, newdata = testing_data, type = "response" ) > 0.5 , "Up" , " confusion_matrix2 <- table ( Actual = testing_data $ Direction, Predicted = predicted_direction2) correct_predictions2 <- sum ( diag (confusion_matrix2)) / sum (confusion_matrix2) confusion_matrix2 7
## Predicted ## Actual Down Up ## Down 9 34 ## Up 5 56 correct_predictions2 ## [1] 0.625 #(e) Repeat (d) using LDA. library (MASS) training_data <- Weekly[Weekly $ Year < 2009 , ] testing_data <- Weekly[Weekly $ Year >= 2009 , ] lda_model <- lda (Direction ~ Lag2, data = training_data) lda_predictions <- predict (lda_model, testing_data) predicted_direction_lda <- lda_predictions $ class confusion_matrix_lda <- table ( Actual = testing_data $ Direction, Predicted = predicted_direction_lda) correct_predictions_lda <- sum ( diag (confusion_matrix_lda)) / sum (confusion_matrix_lda) confusion_matrix_lda ## Predicted ## Actual Down Up ## Down 9 34 ## Up 5 56 correct_predictions_lda ## [1] 0.625 #(f) Repeat (d) using QDA. training_data <- Weekly[Weekly $ Year < 2009 , ] testing_data <- Weekly[Weekly $ Year >= 2009 , ] qda_model <- qda (Direction ~ Lag2, data = training_data) qda_predictions <- predict (qda_model, testing_data) predicted_direction_qda <- qda_predictions $ class confusion_matrix_qda <- table ( Actual = testing_data $ Direction, Predicted = predicted_direction_qda) correct_predictions_qda <- sum ( diag (confusion_matrix_qda)) / sum (confusion_matrix_qda) confusion_matrix_qda ## Predicted ## Actual Down Up ## Down 0 43 ## Up 0 61 8
correct_predictions_qda ## [1] 0.5865385 #(g) Repeat (d) using KNN with K = 1. library (class) training_data <- Weekly[Weekly $ Year < 2009 , ] testing_data <- Weekly[Weekly $ Year >= 2009 , ] training_data_predictor <- data.frame ( Lag2 = training_data $ Lag2) testing_data_predictor <- data.frame ( Lag2 = testing_data $ Lag2) knn_model <- knn ( train = training_data_predictor, test = testing_data_predictor, cl = training_data $ Dire confusion_matrix_knn <- table ( Actual = testing_data $ Direction, Predicted = knn_model) correct_predictions_knn <- sum ( diag (confusion_matrix_knn)) / sum (confusion_matrix_knn) confusion_matrix_knn ## Predicted ## Actual Down Up ## Down 21 22 ## Up 30 31 correct_predictions_knn ## [1] 0.5 #(h) Which of these methods appears to provide the best results on this data? LDA with a score of 0.625 or 62.5% 13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings. library (MASS) library (ISLR) data ( "Boston" ) Boston $ crim01 <- ifelse (Boston $ crim > median (Boston $ crim), 1 , 0 ) set.seed ( 1 ) n <- nrow (Boston) train_indices <- sample ( 1 : n, n / 2 ) train_data <- Boston[train_indices, ] test_data <- Boston[ - train_indices, ] 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
glm.fit <- glm (crim01 ~ . - crim, data = train_data, family = binomial) summary (glm.fit) ## ## Call: ## glm(formula = crim01 ~ . - crim, family = binomial, data = train_data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6596 -0.1582 -0.0002 0.0022 3.7586 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -40.760938 9.177609 -4.441 8.94e-06 *** ## zn -0.119686 0.054901 -2.180 0.029255 * ## indus -0.001286 0.061838 -0.021 0.983414 ## chas 0.799622 1.100037 0.727 0.467284 ## nox 46.304574 10.423302 4.442 8.90e-06 *** ## rm -0.870265 1.082365 -0.804 0.421374 ## age 0.041053 0.017749 2.313 0.020725 * ## dis 1.206786 0.349609 3.452 0.000557 *** ## rad 0.650511 0.225016 2.891 0.003841 ** ## tax -0.003316 0.003640 -0.911 0.362317 ## ptratio 0.476177 0.193469 2.461 0.013845 * ## black -0.007975 0.005461 -1.460 0.144249 ## lstat 0.043148 0.074055 0.583 0.560135 ## medv 0.242090 0.109941 2.202 0.027665 * ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 350.73 on 252 degrees of freedom ## Residual deviance: 105.48 on 239 degrees of freedom ## AIC: 133.48 ## ## Number of Fisher Scoring iterations: 9 glm.probs <- predict (glm.fit, newdata = test_data, type = "response" ) glm.pred <- ifelse (glm.probs > 0.5 , 1 , 0 ) table (glm.pred, test_data $ crim01) ## ## glm.pred 0 1 ## 0 116 15 ## 1 10 112 mean (glm.pred != test_data $ crim01) ## [1] 0.09881423 10
lda.fit <- lda (crim01 ~ . - crim, data = train_data) lda.pred <- predict (lda.fit, newdata = test_data) table (lda.pred $ class, test_data $ crim01) ## ## 0 1 ## 0 118 33 ## 1 8 94 mean (lda.pred $ class != test_data $ crim01) ## [1] 0.1620553 library (class) train_data_X <- train_data[, - 14 ] test_data_X <- test_data[, - 14 ] train_data_crim01 <- train_data $ crim01 knn.pred1 <- knn (train_data_X, test_data_X, train_data_crim01, k = 1 ) table (knn.pred1, test_data $ crim01) ## ## knn.pred1 0 1 ## 0 110 13 ## 1 16 114 knn.pred10 <- knn (train_data_X, test_data_X, train_data_crim01, k = 10 ) table (knn.pred10, test_data $ crim01) ## ## knn.pred10 0 1 ## 0 105 18 ## 1 21 109 knn.pred100 <- knn (train_data_X, test_data_X, train_data_crim01, k = 100 ) table (knn.pred100, test_data $ crim01) ## ## knn.pred100 0 1 ## 0 123 60 ## 1 3 67 Consider the data set provided with this homework assignment. Implement LDA and QDA classifiers on this data and compare the two classifiers using a ROC curve. 11
library (MASS) library (ROCR) ## Warning: package ’ROCR’ was built under R version 4.2.3 data <- read.csv ( "Hw3data-1.csv" ) set.seed ( 123 ) train_indices <- sample ( 1 : nrow (data), 0.7 * nrow (data)) train_data <- data[train_indices, ] test_data <- data[ - train_indices, ] lda_model <- lda (response ~ predictor .1 + predictor .2 + predictor .3 + predictor .4 + predictor .5 , data = qda_model <- qda (response ~ predictor .1 + predictor .2 + predictor .3 + predictor .4 + predictor .5 , data = lda_predictions <- predict (lda_model, newdata = test_data) qda_predictions <- predict (qda_model, newdata = test_data) prediction_lda <- prediction (lda_predictions $ posterior[, 2 ], test_data $ response) prediction_qda <- prediction (qda_predictions $ posterior[, 2 ], test_data $ response) roc_lda <- performance (prediction_lda, "tpr" , "fpr" ) roc_qda <- performance (prediction_qda, "tpr" , "fpr" ) plot (roc_lda, col = "blue" , main = "ROC Curve" ) plot (roc_qda, col = "red" , add = TRUE ) legend ( "bottomright" , legend = c ( "LDA" , "QDA" ), col = c ( "blue" , "red" ), lty = 1 ) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ROC Curve False positive rate True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 LDA QDA 13