Homework 5 - MATH 4322

pdf

School

University of Houston *

*We aren’t endorsed by this school

Course

4322

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

15

Uploaded by hongyumei411

Report
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 1/14 library (MASS) hat_mu = mean (Boston $ medv) hat_mu sd_hat_mu = sd (Boston $ medv) / sqrt ( nrow (Boston)) sd_hat_mu library (boot) set.seed ( 10 ) boot.fn = function(data, index){ mu = mean (data[index]) return (mu) } Homework 5 - MATH 4322 Instructions 1. Due date: November 2, 2023 2. Answer the questions fully for full credit. 3. Scan or Type your answers and submit only one file. (If you submit several files only the recent one uploaded will be graded). 4. Preferably save your file as PDF before uploading. 5. Submit in Canvas. 6. These questions are from An Introduction to Statistical Learning with Applications in R by James, et. al., chapter 8. Problem 1 We will consider the Boston housing data set, from the MASS library. a. Based on this data set, provide an estimate for the population mean of medv . Call this estimate μ ^ . [1] 22.53281 b. Provide an estimate of the standard error of μ ^ . Interpret this result. H i n t : We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations. [1] 0.4088611 c. Now estimate the standard error of μ ^ using the bootstrap. How does this compare to your answer from (b)?
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 2/14 boot_sd_hat_mu $ t0 -2 * 0.40171224 boot_sd_hat_mu $ t0 + 2 * 0.40171224 t.test (Boston $ medv) med_hat_mu = median (Boston $ medv) med_hat_mu ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = Boston$medv, statistic = boot.fn, R = 1000) Bootstrap Statistics : original bias std. error t1* 22.53281 -0.008041502 0.4017124 Standard error of μ ^ using the bootstrap is 0.4017124, it is very close to the answer for (b) d. Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv . Compare it to the results obtained using t.test(Boston$medv) . Hint : You can approximate a 95% confidence interval using the formula $$ [ - 2 (), + 2 ()]. [1] 21.72938 [1] 23.33623 One Sample t-test data: Boston$medv t = 55.111, df = 505, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 21.72953 23.33608 sample estimates: mean of x 22.53281 The 95% confidence intervals from t-test of Boston$medv and from bootstrap estimate are very close. e. Based on this data set, provide an estimate, μ ^ m e d , for the median value of medv in the population. boot_sd_hat_mu = boot (Boston $ medv, boot.fn, 1000 ) boot_sd_hat_mu
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 3/14 set.seed ( 10 ) boot.fn = function(data,index){ med = median (data[index]) return (med) } se_boot_med_hat_mu = boot (Boston $ medv,boot.fn, 1000 ) se_boot_med_hat_mu quantity_hat_mu_0 .1 = quantile (Boston $ medv, . 1 ) quantity_hat_mu_0 .1 set.seed ( 10 ) boot.fn = function(data,index){ hat_mu_0 .1 = quantile (data[index], c ( 0.1 )) return (hat_mu_0 .1 ) } se_hat_mu_0 .1 = boot (Boston $ medv, boot.fn, 1000 ) se_hat_mu_0 .1 [1] 21.2 f. We now would like to estimate the standard error of μ ^ m e d . Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings. ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = Boston$medv, statistic = boot.fn, R = 1000) Bootstrap Statistics : original bias std. error t1* 21.2 -0.00665 0.3745779 The standard error is 0.3745779, it is relatively small g. Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μ ^ 0 . 1 . (You can use the quantile() function.) 10% 12.75 h. Use the bootstrap to estimate the standard error of μ ^ 0 . 1 . Comment on your findings. ORDINARY NONPARAMETRIC BOOTSTRAP
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 4/14 Call: boot(data = Boston$medv, statistic = boot.fn, R = 1000) Bootstrap Statistics : original bias std. error t1* 12.75 -0.0101 0.5064874 The standard error of the 10th percentile is 0.5064874 which is relatively small. Problem 2 The questions relate to the following plots: a. Sketch the tree corresponding to the partition of the predictor space illustrated on the left-hand plot. The numbers inside the boxes indicate the mean of Y within each region. b. Create a diagram similar to the left-hand plot using the tree illustrated in the right-hand plot. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 5/14 Problem 3 Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P (Class is Red| X ): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches? Using the majority vote method, we assign the classification of “Red” to X since it is the most frequently predicted class among the 10 predictions, with 6 votes for Red compared to 4 votes for Green. On the other hand, employing the average probability approach, X is classified as “Green” because the average of the 10 probabilities amounts to 0.45. Problem 4 Provide a detailed explanation of the algorithm that is used to fit a regression tree. The algorithm used to fit a regression tree involves a recursive, top-down approach to partition a dataset into subsets, aiming to predict a continuous numerical target variable. It starts by selecting a feature and a threshold that best divides the data, minimizing a measure of impurity such as Mean Squared Error (MSE). The dataset is split into two subsets based on this feature and threshold, and the process repeats recursively for each subset. A stopping criterion, such as a maximum tree depth or a minimum number of data points in a leaf node, prevents overfitting. After growing the tree, optional pruning can be applied to simplify it and reduce overfitting. The final tree consists of nodes where each leaf node contains a predicted value for the target variable. To make predictions, a data point traverses the tree, and its predicted value is determined by the leaf node it reaches. The resulting regression tree provides a simple and interpretable model for predicting continuous values. Problem 5 This problem involves the OJ data set which is part of the ISLR2 package.
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 6/14 library (ISLR2) library (tree) set.seed ( 10 ) train = sample ( 1 : nrow (OJ), 800 ) OJ.train = OJ[train,] OJ.test = OJ[ - train,] tree.oj = tree (Purchase ~ ., data = OJ.train) summary (tree.oj) tree.oj a. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations. Warning: package 'ISLR2' was built under R version 4.2.3 Attaching package: 'ISLR2' The following object is masked from 'package:MASS': Boston Warning: package 'tree' was built under R version 4.2.3 b. Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have? Classification tree: tree(formula = Purchase ~ ., data = OJ.train) Variables actually used in tree construction: [1] "LoyalCH" "DiscMM" "PriceDiff" Number of terminal nodes: 7 Residual mean deviance: 0.7983 = 633 / 793 Misclassification error rate: 0.1775 = 142 / 800 Training error rate: 17.75 Number of terminal nodes: 7 c. Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed. node), split, n, deviance, yval, (yprob) * denotes terminal node
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 7/14 plot (tree.oj) text (tree.oj, pretty= 0 ) 1) root 800 1067.000 CH ( 0.61375 0.38625 ) 2) LoyalCH < 0.48285 290 315.900 MM ( 0.23448 0.76552 ) 4) LoyalCH < 0.035047 51 9.844 MM ( 0.01961 0.98039 ) * 5) LoyalCH > 0.035047 239 283.600 MM ( 0.28033 0.71967 ) 10) DiscMM < 0.47 220 270.500 MM ( 0.30455 0.69545 ) * 11) DiscMM > 0.47 19 0.000 MM ( 0.00000 1.00000 ) * 3) LoyalCH > 0.48285 510 466.000 CH ( 0.82941 0.17059 ) 6) LoyalCH < 0.764572 245 300.200 CH ( 0.69796 0.30204 ) 12) PriceDiff < 0.145 99 137.000 MM ( 0.47475 0.52525 ) 24) DiscMM < 0.47 82 112.900 CH ( 0.54878 0.45122 ) * 25) DiscMM > 0.47 17 12.320 MM ( 0.11765 0.88235 ) * 13) PriceDiff > 0.145 146 123.800 CH ( 0.84932 0.15068 ) * 7) LoyalCH > 0.764572 265 103.700 CH ( 0.95094 0.04906 ) * Interpreting the terminal node 3 no. of observations = 510 its deviance = 466 overall prediction = “CH” (“CH”, “MM”) = (0.82941 0.17059) d. Create a plot of the tree, and interpret the results. Interpretation: The most important indicator of “puchase” appears to be “loyalch”. Any value less than 0.48285 are all categorized as “MM”
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 8/14 pred = predict (tree.oj, OJ.test, type= "class" ) table (pred, OJ.test $ Purchase) tree.test.error = round ( mean (pred != OJ.test $ Purchase) * 100 , 2 ) tree.test.error set.seed ( 10 ) cv.oj = cv.tree (tree.oj, FUN= prune.misclass) cv.oj plot (cv.oj $ size, cv.oj $ dev, type = "b" , xlab = "Tree size" , ylab = "Deviance" ) e. Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate? pred CH MM CH 135 20 MM 27 88 [1] 17.41 Test Error Rate is 17.41% f. Apply the cv.tree() function to the training set in order to determine the optimal tree size. $size [1] 7 5 2 1 $dev [1] 149 149 164 309 $k [1] -Inf 0.000000 4.333333 154.000000 $method [1] "misclass" attr(,"class") [1] "prune" "tree.sequence" g. Produce a plot with tree size on the x -axis and cross-validated classification error rate on the y -axis.
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 9/14 prune.oj = prune.misclass (tree.oj, best= 2 ) summary (prune.oj) plot (prune.oj) text (prune.oj, pretty= 0 ) h. Which tree size corresponds to the lowest cross-validated classification error rate? A Tree size 5 and 7 corresponds to the lowest cross-validated classification error rate i. Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross- validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes. Classification tree: snip.tree(tree = tree.oj, nodes = 2:3) Variables actually used in tree construction: [1] "LoyalCH" Number of terminal nodes: 2 Residual mean deviance: 0.9798 = 781.8 / 798 Misclassification error rate: 0.1938 = 155 / 800
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 10 /14 summary (tree.oj) summary (prune.oj) Pruned tree’s training error rate = 19.38% j. Compare the training error rates between the pruned and unpruned trees. Which is higher? Classification tree: tree(formula = Purchase ~ ., data = OJ.train) Variables actually used in tree construction: [1] "LoyalCH" "DiscMM" "PriceDiff" Number of terminal nodes: 7 Residual mean deviance: 0.7983 = 633 / 793 Misclassification error rate: 0.1775 = 142 / 800 Classification tree: snip.tree(tree = tree.oj, nodes = 2:3) Variables actually used in tree construction: [1] "LoyalCH" Number of terminal nodes: 2
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 11/14 prune.pred = predict (prune.oj,OJ.test, type= "class" ) table (prune.pred, OJ.test $ Purchase) prune_test_error = round ( mean (prune.pred != OJ.test $ Purchase) * 100 , 2 ) prune_test_error tree.test.error library (ISLR) library (tree) set.seed ( 10 ) Residual mean deviance: 0.9798 = 781.8 / 798 Misclassification error rate: 0.1938 = 155 / 800 Training error rates of unpruned trees is 17.75% Training error rates of pruned trees is 19.38 Pruned trees is higher. k. Compare the test error rates between the pruned and unpruned trees. Which is higher? prune.pred CH MM CH 136 23 MM 26 85 [1] 18.15 [1] 17.41 Test error rates of unpruned trees is 17.41% Test error rates of prunes trees is 18.15% prunes trees is little bit higher. Problem 6 We will use the Carseats data set that is in the ISLR package to see to predict Sales using regression trees and related approaches. a. Split the data set into a training set and a test set. Warning: package 'ISLR' was built under R version 4.2.3 Attaching package: 'ISLR' The following objects are masked from 'package:ISLR2': Auto, Credit
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 12/14 tree.mod = tree (Sales ~ ., data= carseats.train) plot (tree.mod) text (tree.mod, pretty= 0 ) tree.pred = predict (tree.mod, newdata = carseats.test) test.mse = mean ((tree.pred - carseats.test $ Sales) ^ 2 ) test.mse set.seed ( 10 ) cv.carseats = cv.tree (tree.mod) plot (cv.carseats $ size, cv.carseats $ dev, type= 'b' ) b. Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain? [1] 4.712178 c. Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE? train = sample ( 1 : nrow (Carseats), nrow (Carseats) / 3 ) carseats.train = Carseats[train,] carseats.test = Carseats[ - train,]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 13/14 prune.tree.mod = prune.tree (tree.mod, best = 6 ) pred = predict (prune.tree.mod, newdata = carseats.test) test.mse = mean ((pred - carseats.test $ Sales)) test.mse ##install.packages("randomForest") library (randomForest) [1] -0.2688159 The original test MSE was 4.712178 and the cv pruned tree with 6 nodes has test MSE is -0.2688159, it appears the carseats.test has decreased. d. Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Warning: package 'randomForest' was built under R version 4.2.3 randomForest 4.7-1.1 Type rfNews() to see new features/changes/bug fixes.
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 14/14 var.import = importance (bag.mod) var.import rf .3 <- randomForest (Sales ~ ., data= carseats.train, mtry= 3 , importance= TRUE ) rf .5 <- randomForest (Sales ~ ., data= carseats.train, mtry= 5 , importance= TRUE ) importance (rf .3 ) importance (rf .5 ) [1] 3.049626 %IncMSE IncNodePurity CompPrice 6.6922722 102.50503 Income 0.4555865 80.88747 Advertising 4.2274957 58.41494 Population 2.7570513 82.94427 Price 28.9763275 283.95598 ShelveLoc 32.9346930 288.05188 Age 14.5134007 145.10358 Education 1.7750247 60.09735 Urban -1.0629503 11.25107 US -1.0544127 10.48187 The test MSE has decreased from the previous model. However, it is lower than the very first model that was fit. e. Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m , the number of variables considered at each split, on the error rate obtained. %IncMSE IncNodePurity CompPrice 6.9149537 96.76824 Income 0.5994820 82.77402 Advertising 0.3013074 59.02568 Population 2.1145327 81.72381 Price 27.4648083 293.14237 ShelveLoc 32.5074590 288.28943 Age 12.5549058 142.19720 Education -0.9138709 57.44404 Urban -1.0527757 12.95036 US -1.9190993 10.62627 %IncMSE IncNodePurity CompPrice 9.80620729 99.467996 bag.mod = randomForest (Sales ~ ., data = carseats.train, importance = TRUE , ntree = 500 ) bag.pred = predict (bag.mod, newdata = carseats.test) bag.mse = mean ((bag.pred - carseats.test $ Sales) ^ 2 ) bag.mse
10/27/23, 7:41 PM Homework 5 - MATH 4322 localhost:3483 15/14 pred <- predict (rf .3 , newdata= carseats.test) test.mse <- mean ((pred - carseats.test $ Sales) ^ 2 ) test.mse pred <- predict (rf .5 , newdata= carseats.test) test.mse <- mean ((pred - carseats.test $ Sales) ^ 2 ) test.mse Income -0.20504120 67.064586 Advertising 2.09977476 53.953454 Population -0.32824533 64.900113 Price 34.60331639 348.849571 ShelveLoc 40.08566026 338.951640 Age 16.23689797 128.835190 Education -0.08444231 41.509218 Urban -0.17444958 8.265974 US 1.42968841 7.354699 [1] 3.004786 [1] 2.679892 It appears the mtry = 5 very slighly lower test MSE than than mtry=3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help