Spring 2023 Practice Midterm Solutions (2)

pdf

School

Boston University *

*We aren’t endorsed by this school

Course

503

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

10

Uploaded by LieutenantFog13651

Report
Boston University EC503 – Introduction to Learning from Data – Spring 2023 Practice second midterm exam – Solutions 1. Personal Info • Name: • ID: 2. Overview • You should use 1 hours and 45 minutes to complete this exam. • Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself time to answer all of the easy ones, and avoid getting bogged down in the more difficult ones before you have answered the easier ones. Question Topic Max score Your Score 1 True/false 18 2 Short Questions 9 3 MLE 3 4 Boosting 6 5 Dimensionality Reduction and k -means 3 6 Proof 6 7 Neural Networks 3 Total 48 1
1 [18 points, 2 points each] True/False (Write one (1) sentence justification) 1. true/false : k -means algorithm is guaranteed to converge to the global optimum. Solution: False. k -means is guaranteed to converge, but it can get stuck at a local minimum. 2. true/false : EM algorithm is guaranteed to converge because it corresponds to a coordinate descent algorithm. Solution: True. EM algorithm is equivalent to coordinate descent on the G function. 3. true/false : The multiclass hinge loss is convex. Solution: True. It is the max of two convex functions hence it is convex. 4. true/false : AdaBoost with Stump Classifiers as weak learners always achieves training error 0 with respect to the 0/1 loss. Solution: False. To achieve the training error equal to 0 we need the weak learner to have weighted error different than 0.5 on each round of boosting, and in general this is not guaranteed. 5. true/false : Decision Trees have a maximal depth of 3. Solution: False. Any depth can be used. 6. true/false : When we use PCA as a pre-processing step for a linear binary classifier, we could discard useful information for the classification task. Solution: True. PCA aims at minimizing the recostruction error of the samples in the reduced space and does not use the information from the labels at all. 7. true/false : The 0/1 error on the training set of k -Nearest Neighborhood is always 0, independent of k . Solution: False. Consider k > 1 , it is easy to construct an example where the k closest points to a training point have opposite label. 8. true/false : Backpropagation computes an approximate gradient of a loss function for a neural network. Solution: False. Backpropagation is an efficient algorithm to calculate the exact gradient of a composite function. 2
9. true/false : The use of ReLU activation functions completely removes the issue of vanishing gradients. Solution: False. The ReLU activation functions alleviate the problem with respect to, e.g., a sigmoidal activation function. However, the problem is still present because it is also due to the repeated multiplication of small numbers. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2 [9 Points] Short Questions The following short questions should be answered with at most two sentences, and/or a picture. For yes/no questions, make sure to provide a short justification. 1. [3 points] You are given a set S of labeled training points. Explain how you would use them to decide the best k in k -Nearest Neighborhood. Solution: Assuming the dataset is big enough, we can divide it in a Training and Validation set. Then, we pick k that gives the smallest error on the validation set. 2. [3 points] Explain what is the behavior of the approximation error in AdaBoost in relation to the rounds of boosting: Does it decrease? Increase? Why? Solution: The approximation error decreases with the round of boosting because the hypothesis class in- creases monotonically. 3. [3 points] Consider the following strategy for selecting k in k -means: Run k -means with different k values, and choose k that minimizes the sum of squared differences between each point and the closest center on the training set. What is the potential problem of this strategy? Solution: k -means has the minimum error of 0 when k is equal to the number of training samples. Selecting k to minimize the training error will always give such solution that is clearly useless. 4
3 [3 points] Maximum Likelihood Estimation 1. [3 points] We have a regression training set S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } , where x i R d and y i R for i = 1 , . . . , m . We want to use a generative model, assuming that y i = w , x i + ϵ i , where ϵ 1 , . . . , ϵ m are random variables i.i.d. from the Laplace distribution with mean 0 and unknown variance. Write the log likelihood and specify with respect to which parameters we have to maximize it. Solution: The pdf of the zero-mean Laplace distributions parametrized by b is 1 2 b exp | x | b . So, the y i − ⟨ w , x i are i.i.d. Laplace random variables. Hence, the log likelihood L ( S ; w , b ) = log m Y i =1 1 2 b exp | y i − ⟨ w , x i ⟩| b = m X i =1 ln 2 b | y i − ⟨ w , x i ⟩| b . To use MLE, we have to maximize the log likelihood with respect to w and b . 5
4 [6 points] Boosting 1. [3 points] It is possible to show that AdaBoost minimizes the exponential loss on the training set: L exp S ( f ) = 1 m m i =1 exp( y i f ( x i )) , where y = +1 or 1 is the class label, x is the data and f ( x ) is the weighted sum of the weak learners. Show that the loss function L exp S ( f ) is an upper bound on the 0/1 loss function: L 0 / 1 S ( f ) = 1 m m i =1 1 [ y i ̸ = sign( f ( x i ))] , where 1 [ x ] is 1 if x is true and 0 otherwise. Solution: Note that 1 [ y i ̸ = sign( f ( x i ))] is 0 if y i f ( x i ) 0 and 1 otherwise. So, it is enough to observe that exp( x ) is non-negative when x 0 and greater or equal than 1 otherwise. 2. [3 points] Consider the training set in figure, where the ’x’ points are from class +1 and the ’o’ are from class -1. The figure illustrates the decision boundary (the black line) after the first iteration in an AdaBoost classifier with decision stumps as weak learner, where the points in the left semi-plane are classified as positive and the ones in the right semi-plane as negative. Draw (roughly) in a solid line the decision boundary at the second iteration. State your reasoning in 2-3 sentences. -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Solution: -1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 After the first round of boosting, there are two misclassified samples. So, in the second round these two points will have a bigger weight. Hence, the algorithm will look for a stump classifier 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
that, if possible, will correctly classify these two samples and make as few mistakes as possible on the other samples. 7
5 [3 points] Dimensionality Reduction and k -means 1. [3 points] Consider the 2-dimensional datasets in the figure. Which of the following unit vectors expressed in coordinates ( X 1 , X 2) correspond to the theoretically correct directions of the 1st ( p ) and 2nd ( q ) principal components (via centered PCA) respectively for the data shown in the figure? Choose all correct options if there are multiple ones and justify your answer(s). (Points on Figure i) should be considered to be on an exact circle.) (a) Figure i): p = [1 , 0] , q = [0 , 1] ; Figure ii): p = [ 1 2 , 1 2 ] , q = [ 1 2 , 1 2 ] (b) Figure i): p = [1 , 0] , q = [0 , 1] ; Figure ii): p = [ 1 2 , 1 2 ] , q = [ 1 2 , 1 2 ] (c) Figure i): p = [0 , 1] , q = [1 , 0] ; Figure ii): p = [ 1 2 , 1 2 ] , q = [ 1 2 , 1 2 ] (d) Figure i): p = [0 , 1] , q = [1 , 0] ; Figure ii): p = [ 1 2 , 1 2 ] , q = [ 1 2 , 1 2 ] (e) All of the above are correct Solution: (a) and (c) are correct. 8
6 [6 points] Proof In the lecture on k -means, we used an alternating minimization approach to minimize the objective function. In particular, we rewrote the objective function as m X i =1 k X j =1 ϕ i,j x i c j 2 2 where ϕ i,j = 1 { x i is assigned to cluster j } and each point is assigned to one cluster. We said that the algorithm uses two steps: 1. Minimize with respect to ϕ i,j for i = 1 , . . . , m and j = 1 , . . . , k , with c 1 , . . . , c m fixed; 2. Minimize with respect to c 1 , . . . , c m , with ϕ i,j fixed for i = 1 , . . . , m and j = 1 , . . . , k . [3 Points] We said that the first step is obtained assigning each point to its closest center. Why? Solution: First of all, we observe that the objective function is written as a sum over the samples and we can minimize ϕ i,j for each sample separately. So, we have to minimize k j =1 ϕ i,j x i c j 2 2 where ϕ i,j is all zeros, exept for a ’1’. From the expression, we immediately have that for sample i we should put the ’1’ in the position j such that j = arg min n =1 ,...,k x i c n 2 2 . [3 Points] We said that the second step is obtained by setting c j equal to the mean of the training points assigned to the cluster j . Why? Solution: Again, we observe that the objective function is written as a sum, this time over the clusters because I can swap the order of the two sums. So, we can minimize with respect to each c j separately, minimizing m i =1 ϕ i,j x i c j 2 2 where ϕ i,j is fixed. We can calculate the gradient with respect to c j , put it equal to 0 and solve, obtaining that c j is the average of the points that have ϕ i,j = 1 , that is, all the points assigned to the cluster j . 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 [3 points] Neural Networks 1. [3 points] Show that if the activation function of the hidden units is linear, a 2-layer neural network (1 input layer x , 1 hidden layer h and 1 output layer y ) is equivalent to a 1-layer one (1 input layer, 1 output layer). Use your result to explain why a 2-layer network with linear hidden units and sign function in the output layer cannot obtain training error zero using the 0/1 loss on the following training set S = { ((1 , 1) , 0) , ((0 , 0) , 0) , ((0 , 1) , 1) , ((1 , 0) , 1) } , where first two numbers of each couple are the training samples in R 2 and the third one is the label. Solution: Suppose the weights for the first layer form a matrix W 1 with biases b 1 and the weights for the second layer form a matrix W 2 with biases b 2 . A 2-layer neural network with linear activation function is: y = W 2 h + b 2 (1) h = W 1 x + b 1 . (2) This is equivalent to: y = W 2 ( W 1 x + b 1 ) + b 2 . Let W = W 2 W 1 and b = W 2 b 1 + b 2 . This is equivalent to: y = W x + b , that is a network without hidden layer. So, the network with the sign function in the output layer is equivalent to sign( w , x + b ) and the training set S is not linearly separable. Hence, we cannot obtain training error 0 with respect to the 0/1 loss. 10