Final2021fa

pdf

School

Arizona State University *

*We aren’t endorsed by this school

Course

92203

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

16

Uploaded by CorporalTarsierMaster902

Report
CS4780 Final Exam Fall 2021 NAME: Net ID: Email: Please write your netID on the top of each page. I promise to abide by Cornell’s Code of Academic Integrity. Signature: 1
NetID: 1 [??] General Machine Learning Please identify if each of these statements are either True or False. Please justify your answer if False . Statements that evaluate to “True” yield 1 point for correctly identifying it as True. Statements that evaluate to “False” yield 2 points, 1 for correctly identifying it as “False” and 1 for a justification for why the statement is false. 1. ( T/F ) A Kernel function computes the inner product between two feature vectors after a mapping to some (potentially infinite dimensional) feature space. 2. ( T/F ) Classification trees learn maximum margin decision boundaries. 3. ( T/F ) Ridge Regression is equivalent to Ordinary Least Squares Regression with an l 2 -regularization term. 4. ( T/F ) A ReLu network with one hidden layer and a linear last layer is a piece-wise linear function. 5. ( T/F ) A ReLu network with two hidden layers and a linear last layer is a piece-wise linear function. 6. ( T/F ) Each step of stochastic gradient descent is guaranteed to reduce the loss. 7. ( T/F ) Newton’s method needs exactly two iterations to find the minimum of a quadratic objective. 8. ( T/F ) If a classifier obtains 0% training error, it cannot have 100% testing error. 2
NetID: 9. ( T/F ) In bagging, each classifier in the ensemble is trained on a data set that is sampled identically and independently. 10. ( T/F ) The 2 regularizer encourages sparse solutions. 11. ( T/F ) In contrast to the absolute loss, the squared loss is differentiable everywhere and therefore always better. 12. ( T/F ) Did you fill out your course evaluations? Answer ”true” for yes and ”false” for no. (This will actually be graded based on the information we receive letting us know which students completed their evaluations—the evaluations themselves are always anonymous. Thank you for your feedback and suggestions.) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NetID: 2 [25] Large Margin Classification Assume your training data consists of feature vectors x i R d and labels y i ∈ {- 1 , +1 } for i = 1 ...n . You may assume that x i 6 = x j for all i 6 = j . 1. (6 pts) You are training the following model to classify data points: min w ,b w T w + C n X i =1 max[(1 - y i ( w T x i + b )) , 0] (1) (a) (1 pt)What algorithm does this objective function correspond to? (b) (1 pt)How many parameters are you fitting? (c) (1 pt)What are the hyper-parameter(s) of the model? (d) (1 pt)Which part is the regularizer, which part is the loss function? (e) (1 pt)If you incorporate the bias term b into the weight vector by adding a constant dimension to the training data, will you get identical results? Why, why not? 2. (6 pts) You train the classifier on this binary classification data set (shown in the picture). 1 2 3 4 5 6 7 1 2 3 4 5 +1 -1 (a) (1 pt)Draw the decision boundary the classifier will find when C is very large but finite. (b) (1 pt)Circle the support vectors. 4
NetID: (c) (2 pts) If you were to add a point with a positive label at location (1,2) and a point with a negative label at (4,2), would the decision boundary change? Justify your answer. (d) (2 pts) If you were to add a point with a positive label at location (3,2) and a negative point in (2,2), would the change the decision boundary? Justify your answer. 3. (13 pts) Robin uses the classifier from part 1 of this question (i.e., as described by equation (1)) on an binary classification data set. After training, her training error is too high. Therefore, she defines a kernel function k ( x i , x j ) = e - k x i - x j k 2 2 σ 2 . Because she wants to re-use her old code she creates new feature vectors ˜ x i where the j th dimension is defined as: x i ] j = k ( x i , x j ) , and she trains her algorithm on these new feature vectors ˜ x i . (a) (2 pts) Her friend Batman claims that she is just training a “fancy nearest neighbor classifier”? Does he have a point? Explain your answer. (b) (2 pts) What are the trade-offs between choosing large or small σ 2 ? Describe how the feature vectors ˜ x i change as σ 2 becomes very large and very small? (c) (3 pts)In order to set the hyper-parameters, she tries a wide range of choices and for each evaluates the training error. Give a hyper-parameter regime in which she is guaranteed to have zero training error. Justify your answer. Explain why this setting may be problematic. 5
NetID: (d) (2 pts) Suggest a better set-up to choose her hyper-parameters. (e) (2 pts) If she does hyper-parameter search, why could a random search be more effective than grid-search? (f) (2 pts) She decided to use an 1 regularizer, and discovered that with heavily weighted regularization she can find a weight vector that only contains few non-zero parameters and still performs well on the test set. Why could this be highly beneficial at test-time? 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NetID: 3 [12] ML Jungle (a) (b) (c) (d) (e) (f) Figure 1: A visualization of six different regression models after training on the same dataset. Match the plots to their corresponding models and justify each assignment with one sentence . In no particular order, the models are 1. (2 pts) Linear regression 2. (2 pts) Kernel regression with a cubic polynomial kernel k ( x i , x j ) = ( x > i x j ) 3 3. (2 pts) Kernel regression with an RBF kernel k ( x i , x j ) = exp( -k x i - x j k 2 2 ). 4. (2 pts) Regression tree 5. (2 pts) Neural network with ReLu non-linearity, one hidden layer with three units, and a linear last layer 6. (2 pts) Neural network with ReLu non-linearity, one hidden layer with five units, and a linear last layer 7
NetID: 4 [9] Neural Networks Suppose f is a one hidden-layer neural network that can be written as f ( x ) = w > 2 σ ( W 1 x + b 1 ) + b 2 , where W 1 , w 2 are the weight matrices/vectors, b 1 , b 2 are the bias vectors, σ is the activation function that is applied element-wise to its input, and x is the input to the network. 1. (3 pts) Assume σ ( x ) = x . Show that f reduces to a linear function, i.e. f ( x ) = w > x + b , and derive explicit formulas for its weight vector w and bias vector b in terms of W 1 , w 2 , b 1 , and b 2 . Note: This result generalizes and shows that, without non-linearites, even deep neural networks with many layers are equivalent to a one-layer linear network, i.e. a linear function. 2. (3 pts) Say we want to find a neural network that is equivalent to the absolute value function for scalar inputs x R . Assume σ ( x ) = ReLu( x ) and let w 2 = [1 , 1] > . Find weights W 1 R 2 × 1 and biases b R 2 such that w > 2 σ ( W 1 x + b ) = abs( x ) . Hint: Consider the three inputs x = - 1 , 0 , and 1. 3. (3 pts) Suppose you have a fully connected neural network with three hidden layers with 3, 12, and 6 neurons, respectively. Let the input vectors x be two dimensional, and the output y be scalar. What is the total number of parameters of the network? 8
NetID: 5 [16] CART You are given a dataset D of coordinate points with labels y i ∈ {- 1 , 1 } as defined in the table and graph below. x i y i (2, 3) 1 (4, 0) -1 (2, -3) -1 (-2, -3) 1 (-4, 0) -1 (-2, 3) 1 Recall that the Gini impurity function for a set of points S is defined as G ( S ) = X k labels p k (1 - p k ) , where p k is the fraction of points in S with label k . 1. (2 pts) What is the GINI impurity of the split : x 2 > 1? 2. (3 pts) Use the ID3-Algorithm to build a decision tree from the dataset that minimizes Gini impurity. For each split, give the Gini impurity of the split . 3. (5 pts) For each of the following situations, state (yes or no) whether it is a valid stopping criterion for the ID-3 algorithm: (Y/N) All splits lead to exactly the same impurity. (Y/N) No split has a lower impurity than the current set (i.e. no split). (Y/N) All features are identical, but the labels differ. (Y/N) Features vary, but all labels are identical. (Y/N) There is only a single point left in the set. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NetID: 4. (2 pts) Give a data point and label such that, if you added it to the dataset D , you would not be able to create a decision tree with 0% training error. 5. (4 pts) Determine whether the following statements about CART trees are true or false. You do not need to provide a justification for your response. (a) (T/F) CART trees with full depth are good classifiers for bagging, because they have low variance. (b) (T/F) CART trees with limited depth are good for boosting because they are weak learners. (c) (T/F) If you normalize each feature value to range within [0,1] before you build a CART tree, the resulting model will be unchanged. (d) (T/F) If you apply PCA on the input data and project onto the leading k < d dimensions, the resulting CART tree will be unchanged. 10
NetID: 6 [20] Bias Variance Debugging 1. (3 pts) The expected test error decomposes into three terms, can you name them? (Hint: their names are not Ron, Harry, Hermione.) E x ,y,D ( h D ( x ) - y ) 2 | {z } Expected Test Error = E x ,y y ( x ) - y ) 2 | {z } + E x ,D h ( h D ( x ) - ¯ h ( x ) ) 2 i | {z } + E x h ( ¯ h ( x ) - ¯ y ( x ) ) 2 i | {z } 2. (9 pts) For each of the following approaches indicate its effect on bias, variance, and noise. Draw for a reduction, for increase, and — for a neutral effect. effect on variance effect on bias effect on noise increase the regularization increase number of training samples add more (useful) features apply kernelization apply bagging apply boosting use a model with more expressivity add noise to the features 3. (8 pts) For each of the following undesirable model symptoms, state if the cause of poor performance is high bias , high variance , or both . Then, list one possible remedy to resolve it . (a) (2 pts) One of your friends wishes to classify some data using a decision tree. They’re pleased to see that they can specify a tree that has very low training error. However, when they test it, they are dismayed to see it has very high test error. 11
NetID: (b) (2 pts) You are trying to train a linear regression model on your data. However, no matter how long you train it, the training error remains high. Likewise, the test error is high as well. When you visualize your data, you notice that it is non-linear. (c) (2 pts) Your friend wants to to classify if their images contains cars or not. They have a very small number of labeled samples (about 100). You take the latest and greatest deep neural network and try to train on your dataset. However, while the training error approaches 0%, you are unable to lower your test error. (d) (2 pts) Your classmate is trying to use a k-nearest neighbors classifier with k = 1. You observe that the converged model has very high test error. 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NetID: 7 [12] Kernels 7.1 Kernel Properties Suppose you are given a function k k ( x , x 0 ) = φ ( x ) T φ ( x 0 ) , where the x , x 0 are data points in R d and φ ( x ) R D is a vector valued function. Through the following two question, you will prove that k is a valid kernel function. 1. (3 pts) Show that k is symmetric, i.e. k ( x , x 0 ) = k ( x 0 , x ). 2. (3 pts) Let K R n × n be the kernel matrix with entries K ij = k ( x i , x j ), where { x i } n i =1 is an arbitrary set of input vectors. Prove that K is positive semi-definite. Hint: K R n × n is positive semi-definite if for any vector z R n , z T Kz 0. 7.2 Kernelizing Logistic Regression Suppose we have a dataset { x i , y i } , where x i R d and the labels y i ∈ {- 1 , 1 } are binary. For logistic regression, we model the conditional class probability as P ( y | x i , w ) = σ ( y ( w T x i )) , where σ ( s ) = 1 1+ e - s , and w R d . (Note: We dropped the bias term b since we can always absorb it into w by augmenting the data points.) On the homework, you showed that the gradient of P with respect to w is in the span of the data vectors. If we set w 0 = 0 , then the weights w t after t iterations of gradient descent are given by w t = n X j =1 c j x j , for some coefficients c j R . 1. (3 pts) Give an expression for P ( y | x i , w t ), so that the algorithm only needs to keep track of the c i , not w t explicitly. P ( y | x i , w t ) = P ( y |{ c i , x i } ) = ... 2. (3 pts) Given your solution to subquestion 1, how would you kernelize logistic regression? Provide an expression for P ( y |{ c i , x i } ) that employs an arbirtary kernel function k . 13
NetID: 8 [8] Boosting 1. (4 pts) Determine whether the following statements regarding AdaBoost are true or false. You do not need to justify your answer. (a) (T/F) In AdaBoost, weights of the misclassified examples go up by the same multiplicative factor per iteration. (b) (T/F) In AdaGrad, we can think of each feature as having its own learning rate. (c) (T/F) The step size α t in AdaBoost is constant for all t . (d) (T/F) The exponential loss of AdaBoost can always decrease to arbitrarily small value. 2. (4 pts) In this problem, we are going to derive LogitBoost: We seek a model H T ( x ) = T t =1 α t h t ( x ) to minimize the logistic loss: ( H ) = n X i =1 log 1 + e - y i · H ( x i ) , where y i ∈ { 1 , - 1 } for all i and h t ( x ) R for all t . For simplicity, we assume α t = 1 in this problem. After we have finished t iterations and have an ensemble classifier H t ( x ), at iteration t + 1 we want to add one more weak learner h t +1 to the ensemble that minimizes the loss as much as possible: h t +1 = arg min h H ( H t + h ) . (a) (4 pts) In the framework of Anyboost, we select h t +1 by arg min h H n X i =1 r i h ( x i ) , for some r i . Write the expression of r i in terms of H t ( x i ) and y i . (Hint: in Anyboost, we make the first- order approximation to the loss ( H t + h ) ( H t )+ h∇ ( H t ) , h i , where h∇ ( H t ) , h i = n i =1 ∂‘ ( H t ) ∂H t ( x i ) · h ( x i ).) 14
NetID: (b) [ OPTIONAL, complete if you want to impress Margin the Dino and earn a few extra points ] Anyboost makes first-order approximation for the loss. We can also make the second-order approximation: ( H t + h ) ( H t ) + h∇ ( H t ) , h i + 1 2 h > H H t h, (2) where ( H H t ) ij = 2 L ( H t ) ∂H t ( x i ) ∂H t ( x j ) is the Hessian matrix and h > H H t h = n i =1 n j =1 h ( x i ) · h ( x j ) · ( H H t ) ij . i. Show that ( H H t ) ij = 0 when j 6 = i . ii. Show that ( H H t ) ii = e y i · H t ( x i ) ( 1+ e y i · H t ( x i ) ) 2 . iii. Show that to minimize the second order approximation in Equation (2), it’s equivalent to solve a weighted least-squares regression problem below: h t +1 = arg min h H n X i =1 w i ( h ( x i ) - s i ) 2 , (3) where w i = e y i · H t ( x i ) ( 1+ e y i · H t ( x i ) ) 2 and s i = y i 1 + e - y i · H t ( x i ) . (Hint: (i) and (ii) would be helpful.) Comments: Equation (3) is the update step of LogitBoost introduced in the original paper. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
NetID: How did it go? Anything you want to share with us (exam related or not)? Feedback on the course? 16