practice_midterm_soln

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M146

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

8

Uploaded by tachyonphantom

Report
CS M146 Practice Midterm Exam 1 CS M146, Fall 2023 Practice Midterm Exam Instructions 1. This exam is worth 50 points. You have 1 hour 50 minutes to complete it. 2. You are allowed to consult 1 US Letter/A4 sized page (2 sides) of reference notes. 3. You can quote results from the lecture slides, HWs, discussion problems, and practice problems as part of your solutions to the exam problems. 4. No network access is allowed. 5. All answers must be justified. A correct answer, unsupported by calculations and/or explanation will receive no credit; an incorrect answer supported by substantially correct calculations and explanations might still receive partial credit. Blank answers will receive zero credit. Good luck! I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the UCLA Honor Code. Name : UID : Signature : For instructor use Question Score 1 / 30 2 / 10 3 / 10 4 / 10 Total score: / 60
CS M146 Practice Midterm Exam 2 1. [30 points total] General Conceptual Questions For each of the statements below, state True or False. Explain your answer in 1-2 sentences. For every part below, 1 point for correct assertion and 1 point for correct explanation. (a) [2 points] The perceptron training algorithm is guaranteed to converge on any training dataset. Answer: False. The convergence theorem of perceptron only states that it will converge if there exists a set of parameters such that the training data can be separated linearly. (b) [2 points] Logistic regression directly optimizes for the 0/1 loss. Answer: False. Logistic regression optimizes for the surrogate to the 0 / 1 loss. (c) [2 points] Assume a common learning rate that guarantees that both batch gradient descent (GD) and stochastic gradient descent (SGD) converge to the global minima of a loss function. For any dataset with n > 1 instances, SGD is guaranteed to converge faster than GD. Answer: False. The updates in SGD are faster to compute but more noisy. So it is not always guaranteed to converge faster than GD. (d) [2 points] Logistic regression cannot be applied to classification problems with 10 categories. Answer: False. We can use multi-class logistic regression using the softmax hypothesis. (e) [2 points] When training a linear regression model, the learning rate for gradient descent should be set to a very high value to guarantee fast convergence. Answer: False. A very high learning rate can lead to divergence. (f) [2 points] Excessive regularization could potentially increase the bias of a model and lead to underfitting. Answer: True. Regularization decreases the complexity of the learned model, which could increase the bias of the model and lead to underfitting. (g) [2 points] If the model’s validation error is much higher than its training error, then the model is overfitting. Answer: True. The model is not generalizing well to new data and hence, is overfitting, as shown by the high validation error compared to its training error. (h) [2 points] For hyperparameter selection, cross-validation requires training only one model for each hyperparameter value. Answer: False. Cross-validation requires training multiple models for each value of a hyperpa- rameter, corresponding to the number of folds used for cross-validation. (i) [2 points] The derivative of sigmoid( z ) at z = ln 2 is 2/3. Answer: False, the derivative of sigmoid( z ) is sigmoid( z )(1 sigmoid( z )), which equals 2/9 at z = ln 2, the value of sigmoid( z ) is 2/3. (j) [2 points] 2 regularization penalizes || θ || 2 2 = d j =0 θ 2 j , where θ is the parameters of a model. Answer: False, 2 regularization penalizes || θ 1: d || 2 2 = d j =1 θ 2 j ; note that it does not penalize the magnitude of the bias θ 0 . (k) [2 points] The ridge regression objective function has a unique minima for λ > 0. Answer: True, the objective function is strictly convex and hence has a unique minima. (l) [2 points] The ID3 algorithm for constructing decision trees chooses to split on attributes with the lowest conditional entropy. Answer: True, ID3 chooses attributes with the highest information gain / the lowest conditional entropy. (m) [2 points] Consider a binary classification problem where we are given a training set with n training points. k NN with k = n will assign the same label to each test point. Answer: True, we will assign a fixed label to each point when k = n .
CS M146 Practice Midterm Exam 3 2. [10 points total] Linear Models Consider the setting of regression with x ( i ) R d , y ( i ) R . We have access to a training dataset of n examples, { ( x (1) , y (1) ) , ( x (2) , y (2) ) , · · · , ( x ( n ) , y ( n ) ) } . To learn a regression model, we will consider a linear hypothesis class consisting of the following hypothesis: h θ ( x ) = θ T x (1) Here, x R d +1 and θ R d +1 . We set x 0 = 1 such that θ 0 is the bias parameter and θ 1: d are the weight parameters. In the lectures, you have learned the MSE loss for a sample ( x ( i ) , y ( i ) ) MSE ( x ( i ) , y ( i ) , θ ) = 1 2 ( h θ ( x ( i ) ) y ( i ) ) 2 . (2) In your homework, you have worked on the MAE loss for a sample ( x ( i ) , y ( i ) ) MAE ( x ( i ) , y ( i ) , θ ) = | h θ ( x ( i ) ) y ( i ) | . (3) In this problem, we will consider the Huber loss. Given a sample ( x ( i ) , y ( i ) ), the Huber loss is given as: Huber ( x ( i ) , y ( i ) , θ ) = ( 1 2 ( h θ ( x ( i ) ) y ( i ) ) 2 | h θ ( x ( i ) ) y ( i ) | ≤ 1 , | h θ ( x ( i ) ) y ( i ) | − 1 2 otherwise . (4) (a) [3 points] Derive the partial derivative ∂ℓ Huber ( x ( i ) ,y ( i ) , θ ) ∂θ j . [ Hint: Derive separately for both cases: | h θ ( x ( i ) ) y ( i ) | ≤ 1 and otherwise] Answer: ∂ℓ Huber ( x ( i ) , y ( i ) , θ ) ∂θ j = ( ( θ T x ( i ) y ( i ) ) x ( i ) j | h θ ( x ( i ) ) y ( i ) | ≤ 1 , sign( θ T x ( i ) y ( i ) ) x ( i ) j otherwise . (b) [3 points] Derive the vectorized form of the gradient θ Huber ( x ( i ) , y ( i ) , θ ). Answer: θ Huber ( x ( i ) , y ( i ) , θ ) = ( ( θ T x ( i ) y ( i ) ) x ( i ) | h θ ( x ( i ) ) y ( i ) | ≤ 1 , sign( θ T x ( i ) y ( i ) ) x ( i ) otherwise . (c) [4 points] Generally speaking, Huber loss will be less sensitive to outliers (i.e., noisy data points) and is hence often used as an alternative to MSE. Consider θ = [1 , 1 , 1] and the training dataset in Table 1 (note that x 0 = 1 has been included for each datapoint) with n = 5 points. Compute and report the average Huber and MSE loss on the training dataset. For convenience, the expressions for the average losses are as follows: J Huber ( θ ) = 1 n n X i =1 Huber ( x ( i ) , y ( i ) , θ ) J MSE ( θ ) = 1 n n X i =1 MSE ( x ( i ) , y ( i ) , θ )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Midterm Exam 4 Table 1: Linear regression dataset i 1 2 3 4 5 x ( i ) [1, 0, 1] [1, 1, 2] [1, 3, -1] [1, 4, 2] [1, 5, 1] y ( i ) 2 4 3 7 20 Answer: J Huber ( θ ) = 2 . 5, J MSE ( θ ) = 16 . 9. Huber loss is much lower than MSE.
CS M146 Practice Midterm Exam 5 3. [10 points total] Decision Trees Suppose you have the following data set of 10 instances to determine whether Bob decides to buy a car. Input Attribute Label # Electric Price Condition Color Buy 1 Yes L N R Yes 2 No M N G No 3 No M U G No 4 Yes H U R Yes 5 Yes H N G Yes 6 No L N G No 7 No H N R Yes 8 Yes M U G Yes 9 Yes L U G No 10 No L N G No Table 2: Abbreviations: L: Low, M: Medium, H: High, N: New, U: Used, R: Red, G: Green. The data set consists of 4 features: Electric, Price, Condition, Color , and a corresponding label Buy . For all calculations in this problem, use base 2 for the logarithms. For any logarithms in this question, you can use log 2 3 = 1 . 6 for approximation. (a) [1 points] Compute the information gain for the Buy label and the Condition attribute. Answer: No need to compute, I ( Buy, Condition ) = 0. Apply HW 2 problem 2 with S the set of all datapoints, k = 2, S 1 the set of datapoints with Condition = N, S 2 the set of datapoints with Condition = U. (b) [3 points] Compute the information gain for the Buy label and Price attribute. Answer: H ( Buy ) = 5 10 log 2 5 10 5 10 log 2 5 10 = 1 H ( Buy | Price = Low) = 1 4 log 2 1 4 3 4 log 2 3 4 = 0 . 8 H ( Buy | Price = Medium) = 1 3 log 2 1 3 2 3 log 2 2 3 = 1 . 6 2 3 = 0 . 93 H ( Buy | Price = High) = 0 H ( Buy | Price ) = 4 10 × 0 . 8 + 3 10 × 0 . 93 + 3 10 × 0 = 0 . 6 I ( Buy, Price ) = 0 . 4 (c) [4 points] Suppose we picked Price as the root of the tree. Construct the full decision tree using the ID3 algorithm, and briefly explain the attributes you choose. Answer:
CS M146 Practice Midterm Exam 6 i. For Price =High, Buy =Yes. ii. For Price =Low, need to decide the next splitting attribute among Electric , Condition , Color , H ( Buy | Color, Price = Low) = 0 and others are not 0, therefore choose Color as the next attribute. iii. For Price =Medium, H ( Buy | Electric, Price = Medium) = 0 and others are not 0, therefore choose Electric as the next attribute. Price Low Color Red Yes Green No Medium Electric Yes Yes No No High Yes (d) [2 points] Write the predicted labels using the constructed tree for the following test set. Does the constructed tree give zero test error? ( Electric, Price, Condition, Color ) = (Yes, High, Used, Green), with true label Yes ( Electric, Price, Condition, Color ) = (No, Low, Used, Green) with true label Yes Answer: The predicted labels are Yes and No respectively. Test error is 50%.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Midterm Exam 7 4. [10 points total] Regularized Linear Classification Consider the setting of binary classification with +1/-1 labels. We have access to a training dataset of n examples, { ( x (1) , y (1) ) , ( x (2) , y (2) ) , · · · , ( x ( n ) , y ( n ) ) } . To learn a binary classifier, we will consider a linear hypothesis class consisting of the following hypothesis: h θ ( x ) = sign( θ T x ) (5) Here, x R d +1 and θ R d +1 . We set x 0 = 1 such that θ 0 is the bias parameter and θ 1: d are the weight parameters. In this problem, we consider the following empirical risk: R ( θ ) = 1 n n X i =1 log(1 + exp( y ( i ) θ T x ( i ) )) (6) To learn this classifier, we propose to further regularize the above risk and minimize the following regularized loss: J ( θ ) = 1 n n X i =1 log(1 + exp( y ( i ) θ T x ( i ) )) + λ θ 1: d 2 2 (7) where λ > 0 is the regularization coefficient. (a) [3 points] Derive a gradient update rule for minimizing J ( θ ) w.r.t. the weight parameters. Assume the learning rate is α . Answer: Gradient update rule: For all j = 1 , 2 , · · · , d : θ j θ j α ∂J ( θ ) ∂θ j where: ∂J ( θ ) ∂θ j = 1 n n X i =1 y ( i ) x ( i ) j exp( y ( i ) θ T x ( i ) ) 1 + exp( y ( i ) θ T x ( i ) ) + 2 λθ j Alternative correct answers will use vectorized notation, θ 1: d θ 1: d α θ 1: d J ( θ ) where: θ 1: d J ( θ ) = 1 n n X i =1 y ( i ) x ( i ) 1: d exp( y ( i ) θ T x ( i ) ) 1 + exp( y ( i ) θ T x ( i ) ) + 2 λ θ 1: d (b) [3 points] Derive a gradient update rule for minimizing J ( θ ) w.r.t. the bias parameters. Assume the learning rate is α . Answer: Gradient update rule: θ 0 θ 0 α ∂J ( θ ) ∂θ 0 where: ∂J ( θ ) ∂θ 0 = 1 n n X i =1 y ( i ) exp( y ( i ) θ T x ( i ) ) 1 + exp( y ( i ) θ T x ( i ) )
CS M146 Practice Midterm Exam 8 (c) [2 points] Assume we fix θ 0 = 0 and choose a very large λ → ∞ . Let θ = ˜ θ denote the opti- mal solution obtained by minimizing J ( θ ) under these assumptions. What is the corresponding empirical risk R ( ˜ θ )? Explain your answer. Answer: As λ → ∞ , ˜ θ 0 and R ( ˜ θ ) = log 2. (d) [2 points] Recall, in class, we studied the perceptron model with the same hypothesis as Eq.( 5). However, we defined the empirical risk for the perceptron differently, as: R perceptron ( θ ) = 1 n n X i =1 max(0 , y ( i ) θ T x ( i ) ) (8) Consider a hypothesis with parameters θ that achieves full accuracy on the training dataset, i.e., h θ ( x ( i ) ) = y ( i ) for all i = 1 , 2 , ..n . Is R perceptron ( θ ) > R ( θ )? Explain your answer. Answer: No. R perceptron ( θ ) = 0 as y ( i ) θ T x ( i ) 0 i R ( θ ) > 0 for all values of θ (including θ ).