practice-finalKEY

pdf

School

University of Pittsburgh *

*We aren’t endorsed by this school

Course

1000

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

6

Uploaded by BrigadierIce794

Report
Practice Final Exam – KEY 36-202 – Methods for Statistics & Data Science Spring 2022 This document is available on: http://www.cmu.edu/canvas/ ‘files’ The final exam is Monday May 2, from 1:00PM to 4:00PM, in the following locations: Sections A, C, E: Hall of Arts (HOA) 160 (Cooper-Simon Lecture Hall) Section B: Posner 152 Section D: Posner 153 Section F: Posner 151 Calculator required; Two pages front and back of notes allowed. Coverage on the final exam: Approximately 1/3 from exam 1 material (linear regression models), approximately 1/3 from exam 2 material (ANOVA and logistic regression models), and approximately 1/3 from machine learning (middle of lecture 26 onwards / lab 10-11)
36-202 Spring 2022 Page 2 of 6 Practice Final Exam This document is available on: http://www.cmu.edu/canvas/ ‘files’ PRACTICE FOR EXAM 1 AND EXAM 2 MATERIAL: Look at exam 1 and 2, and practice exam 1 and 2 PRACTICE MACHINE LEARNING SCENARIO: Classifying/Clustering New England Softshell Crabs Seafood is a multimillion dollar business and a vital part of the economy of various regions on the Eastern Seaboard of the U.S. With climate change altering the marine environment, it is of interest to better understand how to model and predict various types of seafood. One particular type of softshell crab is known to come in two color varieties (Blue and Orange). We are interested in developing statistical models to classify or cluster the crabs by color. A sample of 100 crabs is available, with various possible predictors: FL: Frontal Lobe size (millimeters) RW: Rear Width (millimeters) CL: Carapace Length [the carapace is the shell of the crab] (millimeters) CW: Carapace Width (millimeters) BD: Body Depth (millimeters) Gender (M/F) (a) We begin with three different classification schemes to classify the crabs by color: LDA, QDA, and Logistic Regression. The resulting confusion matrices (i.e., error tables) are as shown: Logistic: LDA: QDA: Overall error rate = (31+32)/100 = 63% error Overall error rate = (32+33)/100 = 65% error Overall error rate = (36+39)/100 = 75% error In terms of overall error rate, which classifier (Logistic, LDA, or QDA) did best? Include your values. Error rates are shown above. Logistic did best (although they all did terrible).
36-202 Spring 2022 Page 3 of 6 Practice Final Exam This document is available on: http://www.cmu.edu/canvas/ ‘files’ Next, we turn to a Classification tree to classify the crabs as Blue (B) or as Orange (O); the results are as follows: (b) Based on the tree classifier, what is the “most important” variable for classifying the crab species? The top variable on the tree is what the classifier deemed “most important”, so RW (i.e., Rear Width). (c) Describe a tree path (in terms of their decision rules involving the variables) that would predict Orange (O) as the most likely group. There are many such paths; two are indicated on the image above (in red and blue), as follows: RW < 13.65 and BD ≥ 13.55 and BD ≥ 16.05 [the path indicated in red above] or RW ≥ 13.65 and FL < 17.55 and CW < 36.75 [the path indicated in blue above] There are many others. Remember that on a classification tree, “true” is to the left. [continues on next 3 pages]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
36-202 Spring 2022 Page 4 of 6 Practice Final Exam This document is available on: http://www.cmu.edu/canvas/ ‘files’ We then turn to clustering algorithms to seek clusters of observations. First, we consider hierarchical clustering. The resulting dendrogram is as follows: (d) The hierarchical clustering routine used to produce the dendrogram was “complete linkage.” Is this better at finding spherical/compact clusters of observations in variable space, or finding thin “rope-like” chains of observations in variable space? Complete li nkage is better at finding spherical/compact clusters. (e) If we cut the dendrogram at height 20, how many clusters does it predict? Four clusters. [continues on next 2 pages]
36-202 Spring 2022 Page 5 of 6 Practice Final Exam This document is available on: http://www.cmu.edu/canvas/ ‘files’ Next, we consider spherical K-means clustering. To select the number K of clusters, we consider an elbow graph, as follows: (f) Based on the elbow graph, what value of K might be reasonable to choose, and why? We want a K where steepness becomes more shallow (i.e., where we start to get less gain in compactness, and hence start to worry about overfitting). I might think K=5 is a reasonable choice; maybe K=4 is a good second choice if I’m more worried about overfitting. K=6 is perhaps a third best choice (the change from K=5 to K=6 doesn’t help that much, so I might want to try to avoid that a bit more due to the risk of overfitting). (g) We decide to compare the hierarchical clustering cut at height 20 to the K- means clustering with K=4. The resulting ARI values are as follows: For hierarchical clustering: ARI = 0.018 For K-means clustering: ARI = 0.794 Which one did better? Higher ARI is better, so K-means did much better. [continues on next page]
36-202 Spring 2022 Page 6 of 6 Practice Final Exam This document is available on: http://www.cmu.edu/canvas/ ‘files’ (h) Having run the K-means clustering with K=4, we then compare the results to the “true” groupings of crabs by type (Blue/Orange) and by Gender (Male/Female), and we obtain the following table: For each cluster, say which group they are a best match for. Cluster 1 best matches to Blue/Female; Cluster 2 best matches to Orange/Male; Cluster 3 matches to Blue/Male; Cluster 4 best matches to Ora nge/Female. [end of practice final]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help