midterm_solution

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M146

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

6

Uploaded by tachyonphantom

Report
CS M146 Midterm Exam 4 2. [8 points total] k -Nearest Neighbors In the following questions, you will consider k -nearest neighbor classifiers with two different distance metrics on a binary classification task. Below is the figure for the dataset, which is divided into positive and negative classes. Note that the annotated data points (from 1 to 5) belong to the test set, and the rest belong to the training set. All points are positioned on grid intersections. Figure 1: Dataset for k NN binary classification task. (a) [3 points] Given that the k NN classifier uses Euclidean distance as the distance metric, calculate the test error for k = 1 for the data points marked as 1 , 2 , 3 , 4 , 5 . The test error is defined as the fraction of test points that are misclassified. What class would each point in the test set be assigned? Fill in the table below. Note that a test point cannot be its own nearest neighbor as neighbors must belong to the training set. Test Point 1-NN Truth 1 + 2 - 3 - 4 + 5 +
CS M146 Midterm Exam 5 Answer: Test error = 3/5 = 60%. Test Point 1-NN Truth 1 + + 2 + - 3 - - 4 - + 5 - + (b) [5 points] Previously we used Euclidean distance as our distance metric. Now, we will switch to using the Manhattan distance , also known as the “city block” distance. This metric conceptu- alizes walking between two points in a city laid out in a grid: you can only walk along the streets, not diagonally through buildings, and you add up all the blocks you’d have to walk. Formally, the Manhattan distance between points x (1) = ( x (1) 1 , x (1) 2 ) and x (2) = ( x (2) 1 , x (2) 2 ) is defined as: d ( x (1) , x (2) ) = | x (1) 1 x (2) 1 | + | x (1) 2 x (2) 2 | Calculate the test error using k = 3 neighbors. Identify and clearly indicate which test points are misclassified. Fill in the table below. Test Point Test Point Coordinates Nearest Neighbors (Coordinates) Manhattan Distances (to Nearest Neighbors) Predicted Class Actual Class 1 (2, 6) + 2 (4, 1) - 3 (4, 7) - 4 (8, 2) + 5 (9, 7) + Table 1: Classification results using k-NN with Manhattan distance Answer: Test Point Test Point Coordinates Nearest Neighbors (Coordinates) Manhattan Distances (to Neighbors) Predicted Class Actual Class 1 (2, 6) { (1, 5), (2, 3), (4, 8) } { 2, 3, 4 } + + 2 (4, 1) { (4, 2), (6, 1), (5, 3) } { 1, 2, 3 } + - 3 (4, 7) { (4, 8), (7, 8), (6, 9) } { 1, 4, 4 } - - 4 (8, 2) { (8, 3), (7, 4), (6, 1) } { 1, 3, 3 } - + 5 (9, 7) { (9, 6), (7, 8), (7, 5) } { 1, 3, 4 } + + Table 2: Classification results using k-NN with Manhattan distance
CS M146 Midterm Exam 7 3. [10 points total] Decision Trees Suppose you have the following data set of 10 instances to determine whether Alice rates a dish as favorite. Input Attribute Label # Size Taste Temperature Favorite 1 S Sa H Yes 2 S Sa C Yes 3 S Sa C Yes 4 L Sw C Yes 5 L Sw C No 6 L So H No 7 S So H No 8 S So C No 9 L Sw H No 10 L Sw H No Table 3: Abbreviations: S: Small, L: Large, Sa: Salty, Sw: Sweet, So: Sour, H: Hot, C: Cold. The data set consists of 3 features: Size, Taste, Temperature , and a corresponding label Favorite . For all calculations in this problem, use base 2 for the logarithms. (a) [6 points] In the above table, compute the information gains: I ( Favorite, Size ) I ( Favorite, Taste ) I ( Favorite, Temperature )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Midterm Exam 8 Answer: We compute the information gains as follows H ( Favorite ) = 0 . 4 log 0 . 4 0 . 6 log 0 . 6 = 0 . 970950594454 H ( Favorite | Size ) = 0 . 5( 0 . 4 log 0 . 4 0 . 6 log 0 . 6) + 0 . 5( 0 . 2 log 0 . 2 0 . 8 log 0 . 8) = 0 . 846439344 H ( Favorite | Taste ) = 0 . 4( 0 . 25 log 0 . 25 0 . 75 log 0 . 75) = 0 . 32451124978 H ( Favorite | Temperature ) = 0 . 5( 0 . 2 log 0 . 2 0 . 8 log 0 . 8) + 0 . 5( 0 . 4 log 0 . 4 0 . 6 log 0 . 6) = 0 . 846439344 I ( Favorite, Size ) = 0 . 9709505 0 . 846439 = 0 . 1245115 I ( Favorite, Taste ) = 0 . 9709505 0 . 3245 = 0 . 6469505 I ( Favorite, Temperature ) = 0 . 9709505 0 . 8464393 = 0 . 124511 (b) [3 points] What attribute should be selected as the root node for the decision tree? Justify your answer. Using the ID3 algorithm, construct a decision tree using the selected root node. What is the training accuracy? Answer: If we compare the information gain of the various attributes, Taste should be selected as the root node. Running the ID3 algorithm with Taste as the root, we obtain a decision tree with 3 decision nodes. Taste So No Sw Temperature H No C size(Yes or No(tie)) Sa Yes The tree obtains a training accuracy of 0.9, since one point is misclassified. (c) [1 points] Suppose you are allowed only a tree of depth 1 (i.e., a tree with only the root node). Redraw the tree from part (b), pruning nodes at greater depths and replace them with leaf nodes that maximize overall training accuracy. Answer: Taste So No Sw No Sa Yes
CS M146 Midterm Exam 9 4. [12 points total] Linear Models Consider the setting of regression with x ( i ) R d , y ( i ) R + . Note we are assuming labels can only take on positive scalars. We have access to a training dataset of n examples, { ( x (1) , y (1) ) , ( x (2) , y (2) ) , · · · , ( x ( n ) , y ( n ) ) } . To learn a regression model, we will consider a linear hypothesis class consisting of the following hypothesis: h θ ( x ) = | θ T x | (1) Here, we use the augmented representation x R d +1 and θ R d +1 . We set x 0 = 1 such that θ 0 is the bias parameter and θ 1: d are the weight parameters. In the lectures, you have learned the MSE loss for a training sample ( x ( i ) , y ( i ) ) MSE ( x ( i ) , y ( i ) , θ ) = 1 2 h y ( i ) h θ ( x ( i ) ) i 2 . (2) In this problem, we will consider the Mean Squared Logarithmic Error (MSLE) loss. Given a sample ( x ( i ) , y ( i ) ), the MSLE loss is given as: MSLE ( x ( i ) , y ( i ) , θ ) = 1 2 h log(1 + y ( i ) ) log(1 + h θ ( x ( i ) )) i 2 . (3) (a) [6 points] The MSLE loss is preferred over MSE when the target outcomes span several orders of magnitude and potentially have outliers. In these cases, we only want to penalize a model for relative errors. Examples include stock market prices, web traffic at different times of the day, etc. For the dataset below, complete the table by computing the MSE and MSLE loss. [You do not need to show your work.] Table 4: Linear regression dataset i 1 2 3 h θ ( x ( i ) ) 3 30 300 y ( i ) 2 20 200 MSE ( x ( i ) , y ( i ) , θ ) MSLE ( x ( i ) , y ( i ) , θ ) Answer: Table 5: Linear regression dataset i 1 2 3 h θ ( x ( i ) ) 3 30 300 y ( i ) 2 20 200 MSE ( x ( i ) , y ( i ) , θ ) 0.5 50 5000 MSLE ( x ( i ) , y ( i ) , θ ) 0.04138 0.07584 0.08153
CS M146 Midterm Exam 10 (b) [4 points] Derive the partial derivative ∂ℓ MSLE ( x ( i ) ,y ( i ) , θ ) ∂θ j . [ Hint: You can consider using sign function or deriving separately for both cases: θ T x 0 and otherwise] Answer: Case 1: θ T x ( i ) 0 h θ ( x ) = θ T x ( i ) ∂ℓ MSLE ( x ( i ) , y ( i ) , θ ) ∂θ j = log(1 + y ( i ) ) log(1 θ T x ( i ) ) 1 θ T x ( i ) x ( i ) j Case 2: θ T x ( i ) > 0 h θ ( x ) = θ T x ( i ) ∂ℓ MSLE ( x ( i ) , y ( i ) , θ ) ∂θ j = log(1 + θ T x ( i ) ) log(1 + y ( i ) ) 1 + θ T x ( i ) x ( i ) j (c) [2 points] Write the gradient θ MSLE ( x ( i ) , y ( i ) , θ ) using vectorized notation. Answer: θ MSLE ( x ( i ) , y ( i ) , θ ) = ( log(1+ y ( i ) ) log(1 θ T x ( i ) ) 1 θ T x ( i ) x ( i ) θ T x ( i ) 0 , log(1+ θ T x ( i ) ) log(1+ y ( i ) ) 1+ θ T x ( i ) x ( i ) otherwise .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help