final_exam_solutions

pdf

School

New York University *

*We aren’t endorsed by this school

Course

473

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

15

Uploaded by DeanFox3968

Report
Introduction to Machine Learning (CSCI-UA.473): Final Exam Solutions Instructor: Sumit Chopra December 21 st , 2021 1 Probability and Basics (20 Points) 1. Let X and Y be discrete random variables. You are given their joint distribution p(X, Y), and let D denote the training set and θ denote the parameters of your model. Answer the following questions: (a) [2 Points] Write the expression of the marginal distribution over X provided p ( X, Y ). [Sol:] p ( X ) = X y p ( X, y ) (1) (b) [2 Points] Write the conditional distribution of X given Y provided p ( X, Y ), p ( X ), and p ( Y ). [Sol:] p ( X | Y ) = p ( X, Y ) p ( Y ) (2) (c) [2 Points] Write the posterior distribution of Y given p ( X | Y ), p ( Y ), and p ( X ). [Sol:] p ( Y | X ) = p ( X, Y ) p ( X ) = p ( X | Y ) · p ( Y ) p ( X ) (3) (d) [2 Points] Write the expression of the posterior distribution of the parameters θ , given the prior p ( θ ) and likelihood of the data. [Sol:] p ( θ |D ) = p ( D , θ ) p ( D ) = p ( D| θ ) p ( θ ) p ( D ) (4) 2. [4 Points] Show that the Tanh function (tanh) and the Logistic Sigmoid function ( σ ) are related by tanh( a ) = 2 σ (2 a ) - 1 . 1
[Sol:] 2 σ (2 a ) - 1 = 2 1 + e - 2 a - 1 = 2 1 + e - 2 a - 1 + e - 2 a 1 + e - 2 a = 1 - e - 2 a 1 + e - 2 a = e a - e - a e a + e - a = tanh( a ) 3. [8 Points] Show that a general linear combination of Logistic Sigmoid functions of the form y ( x, w ) = w 0 + M X j =1 w j · σ x - μ j s is equivalent to a linear combination of tanh functions of the form y ( x, u ) = u 0 + M X j =1 u j · tanh x - μ j s and find the expression that relates the new parameters { u 1 , . . . , u M } to the original parameters { w 1 , . . . , w M } . [Sol:] If we take a j = ( x - μ j ) / 2 s , we can re-write the first equation as y ( x, w ) = w 0 + M X j =1 w j σ (2 a j ) = w 0 + M X j =1 w j 2 (2 σ (2 a j ) - 1 + 1) = u 0 + M X j =1 u j tanh( a j ) , where u j = w j / 2 for j = 1 , . . . , M and u 0 = w 0 + M j =1 w j / 2 2
2 Parametric Models (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be the training set where each training sample ( x i , y i ) is independently and identically distributed. The model is given by ˆ y i = w T x i + i , where i ∼ N (0 , σ 2 ). Answer the following questions: (a) [4 Points] Let Y = [ y 1 , . . . , y N ] be a vector of all the labels and X = [ x 1 , . . . , x N ] be a matrix of all the inputs. Write the expression for conditional likelihood p ( Y | X ). [Sol:] Y | X N ( W T X, σ 2 ) p ( Y | X ) = N Y i =1 1 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 (b) [8 Points] Assume that the prior distribution of the parameters θ is gaussian: θ ∼ N (0 , β 2 I ). Show that computing the MAP estimate of the parameters is equivalent to minimizing a loss function composed of the mean squared error and an L 2 regularizer. [Sol:] θ MAP = arg max θ [ p ( Y | X, θ ) · p ( θ )] We already have shown the expression for p ( Y | X, θ ) in the previous part and p ( θ ) is provided as the gaussian N (0 , β 2 I ). θ MAP = arg min θ N Y i 1 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 · 1 p 2 πβ 2 e - θ T θ 2 β 2 Now we can convert the argmax operation to an argmin by taking the negative log and to further simplify remove the constants, θ MAP = arg min θ N X i ( w T x i - y i ) 2 2 σ 2 + 1 2 β θ T θ Let λ = σ 2 β , thus we have : θ MAP = arg min θ 1 2 n X i ( w T x i - y i ) 2 + λθ T θ 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2. [8 Points] In the following data set A , B , C are input binary random variables, and y is a binary output whose value we want to predict: A B C y 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 How will the Naive Bayes Classifier predict y given the input: A = 0, B = 0, C = 1? [Sol:] From the table above we have: P ( y = 0) = 3 / 7 P ( y = 1) = 4 / 7 P ( A = 0 | y = 0) = 2 / 3 P ( B = 0 | y = 0) = 1 / 3 P ( C = 1 | y = 0) = 1 / 3 P ( A = 0 | y = 1) = 1 / 4 P ( B = 0 | y = 1) = 1 / 2 P ( C = 1 | y = 1) = 1 / 2 Predicted y maximizes P ( A = 0 | y ) P ( B = 0 | y ) P ( C = 1 | y ) P ( y ). For y = 0 we have: P ( A = 0 | y = 0) P ( B = 0 | y = 0) P ( C = 1 | y = 0) P ( y = 0) = 0 . 0317 . For y = 1 we have: P ( A = 0 | y = 1) P ( B = 0 | y = 1) P ( C = 1 | y = 1) P ( y = 0) = 0 . 0357 . Thus the Naive Bayes Classifier will predict y = 1. 4
3 Support Vector Machines (20 Points) 1. [10 Points] Kernel functions implicitly define a mapping function φ ( · ) that transforms an input instance x ∈ < d to a high dimensional feature space Q by giving the form of a dot product in Q : K ( x i , x j ) = φ ( x i ) · φ ( x j ). Assume we use a kernel function of the form K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ). Thus we assume that there is some implicit unknown function φ ( x ) such that φ ( x i ) · φ ( x j ) = K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ) . Prove that for any two input instances x i and x j , the squared Euclidean distance of their corresponding points in the feature space Q is less than 2. That is prove that k φ ( x i ) - φ ( x j ) k 2 < 2. [Sol:] k φ ( x i ) - φ ( x j ) k 2 = ( φ ( x i ) - φ ( x j )) · ( φ ( x i ) - φ ( x j )) = φ ( x i ) · φ ( x i ) + φ ( x j ) · φ ( x j ) - 2 · φ ( x i ) φ ( x j ) = 2 - 2 exp( - 1 2 || x i - x j || 2 ) < 2 since exp( - 1 2 || x i - x j || 2 ) > 0. 2. [4 Points] Let M SV M be an SVM model which you’ve trained using some training set. Consider a new point that is correctly classified and distant from the decision boundary. Why would SVMs decision boundary be unaffected by this point, but the one learnt by logistic regression would be affected. [Sol:] This is because the Hinge loss used by the SVM will assign a zero weight to this distant point and hence it’ll not have any effect of the learning of the decision boundary. In contrast the negative log likelihood loss function optimized by the logistic regression will allocate some non-zero weight to this point. Hence the decision boundary for logistic regression will be affected. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] If we switch from a linear kernel to a higher order polynomial kernel the support vectors will not change. [Sol:] False. There is no guarantee that the support vectors would remain the same because the transformed feature vectors implicitly generated by the polynomial kernel are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different. (b) [3 Points] The maximum margin decision boundary that the SVMs learn provide the best generalization error among all linear classifiers. 5
[Sol:] False. The maximum margin separating hyperplane is often a reasonable choice to pick among all the possible separating hyper-planes however there is no guarantee that this hyper-plane will provide the best generalization error among all possible hyper-planes. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Deep Learning and Neural Network (20 Points) 1. For the binary classification task, the cross-entropy loss is defined by E ( w ) = - N X n =1 { y n log ˆ y n + (1 - y n ) log(1 - ˆ y n ) } , where y n is the true label and ˆ y n is the predicted probability of the correct label given by ˆ y n = f ( x n , w ), and the function f is the neural network with parameters w . The above loss function assumes that the output of the function f is between 0 and 1: 0 f ( x n , w ) 1 (typically obtained by using a Logistic Sigmoid activation on top of the last layer of the neural network), and the data have target values y n ∈ { 0 , 1 } . (a) [6 Points] Derive the corresponding error function if we consider a network having an output - 1 f ( x n , w ) 1 and target values y 1 = 1 for class C 1 and y 2 = - 1 for class C 2 . [Sol:] This simply corresponds to a scaling and shifting of the binary outputs, which directly gives the activation function in the form y = 2 σ ( a ) - 1 , (5) where σ ( a ) = 1 / 1 + exp( - a ) the Logistic Sigmoid function. The corresponding error functions can then be constructed using the above equation by applying the inverse transform to y n and ˆ y n , yielding E ( w ) = - N X n =1 { y n log ˆ y n + (1 - y n ) log(1 - ˆ y n ) } (6) = - N X n =1 1 + y n 2 log 1 + ˆ y n 2 + (1 - 1 + y n 2 ) log(1 - 1 + ˆ y n 2 ) (7) = - 1 2 N X n =1 [(1 + y n ) log(1 + ˆ y n ) + (1 - y n ) log(1 - ˆ y n )] + N log 2 , (8) where the last term can be dropped since it is independent of w . (b) [2 Points] What would be the appropriate choice of output unit activation func- tion? [Sol:] To find the corresponding activation function we simply apply the linear transfor- 7
mation of the logistic sigmoid, which gives us y ( a ) = 2 σ ( a ) - 1 = 2 1 + e - a - 1 = 1 - e - a 1 + e - a = e a/ 2 - e - a/ 2 e a/ 2 + e - a/ 2 = tanh( a/ 2) . 2. [3 Points] List any three methods one can use to regularize neural networks? You DO NOT need to describe the methods in detail. A one line description should suffice. [Sol:] The three potential ways in which one can regularize neural networks are (a) Weight decay (b) Early stopping the training (c) Using dropout connections 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] Convolutional neural networks are more flexible than fully connected neural networks because they can express a wider array of functions corresponding to different settings of their parameters. [Sol:] False. The neuron connection pattern of a convolutional network has a lot more structure (and hence restricted) compared to a fully connected neural network which has very flexible connections. (b) [3 Points] Consider a 1000 layer fully connected network of the form X n = W n , X n - 1 for n = 1 , . . . , 1000, where X 0 is the input to the network. Let us denote this network by N 1000 . Consider another network (call it N 2 ) which has two hidden layers and is of the following functional form: X 2 = f ( W 2 , f ( W 1 , X 0 )), where the function f () is the tanh function. The network N 1000 is significantly more expressive than network N 2 . [Sol:] False. The network N 2 is more expressive because it is truly a non-linear network (it is essentially a single hidden layer neural network which can also be considered as a universal approximator), compared to N 1 000 which is essentially a linear network (composition of linear operations is linear). (c) [3 Points] Batch gradient descent is less prone to getting stuck in a local minima because it computes the gradients of the loss function with all the examples in the training set before making the gradient update. [Sol:] False. Since in a batch gradient descent one always takes a step in the direction of steepest descent, which is computed in a greedy manner, there is a high chance 8
of getting stuck in a local minima if your initial weights are close to it. Whereas in stochastic gradient descent because the gradients are noisy (computed using a single example) they need not always point in the direction of steepest descent, which in-turn enables it to escape the nearby local minima. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5 Decision Trees, Ensembles, and Boosting (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } are independently and identically drawn training sam- ples from the underlying unknown distribution P . Let D * denote the bootstrap dataset also obtained from the distribution P and consisting of observations x * i , y * i : i = 1 , . . . N . (Note that in practice we cannot do this since we do not have access to the distribution P . We can only simulate it by bootstrap sampling from D .) Let f ag ( x ) denote the ideal aggregate estimate of the model and ˆ f * ( x ) denote the prediction of the model trained on the bootstrapped sample. (a) [6 Points] Show that f ag ( x ) is a better estimate than ˆ f * ( x ). [Sol:] The expected value of the error of ˆ f * ( x ) is given by E P [ Y - ˆ f * ( x )] 2 = E P [ Y - f ag ( x ) + f ag ( x ) - ˆ f * ( x )] 2 (9) = E P [ Y - f ag ( x )] 2 + E P [ ˆ f * ( x ) - f ag ( x )] 2 (10) E P [ Y - f ag ( x )] 2 (11) (b) [2 Points] Using the above result as a basis, argue (in no more than a few lines) that Bagging will often lead to a reduction in the mean squared error. [Sol:] The outcome of Bagging is an aggregate estimate ˆ f ag ( x ) which is estimated by simulating the bootstrapping process described above. The difference between the two processes is that in Bagging the bootstrap samples are created from 2. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] A two layer neural network with linear activation functions is essentially a weighted combination of linear separators trained on a given data set. The boosting algorithm built on linear separators also finds a linear combination of the linear separators. Therefore these two algorithms are equivalent and will give the same result. [Sol:] False. While in both cases multiple linear classifiers are combined to form a single model, in the case of neural networks the classifiers are trained independent of each other while in the case of boosting every subsequent linear classifier is trained on a weighted training set whose weights are dependent on the performance of the previous classifiers. Thus in case of boosting the linear classifiers are not independent resulting in a different aggregate model. (b) [3 Points] Decision trees are learnt by minimizing the information gain. [Sol:] False. Decision trees are learnt by maximizing the information gain. (c) [3 Points] The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative. [Sol:] True. The coefficients are always positive. (d) [3 Points] One way to prevent a Decision Tree model from over fitting is combin- ing it with other Decision Trees trained on bootstrapped samples of the training set. 10
[Sol:] True. This process is called Bagging. 11
6 Dimensionality Reduction and Non-Parametric Meth- ods (20 Points) 1. Consider the following 1-dimensional data set X = [ - 3 , - 2 , - 1 , 1 , 2 , 99] . (12) Assume that you want to cluster this data set into two clusters and you start with the initial means: μ 0 = - 3 and μ 1 = 3. Answer the following questions. (a) [3 Points] What will be the clusters chosen by the k-means algorithm? [Sol:] The initial cluster assignments will be C 0 = [ - 3 , - 2 , - 1] and C 1 = [1 , 2 , 99]. The means will be μ 0 = - 2 and μ 1 = 34. At the end of first iteration the cluster assignments will be C 0 = [ - 3 , - 2 , - 1 , 1 , 2] and C 1 = [99], and the new means will be μ 0 = - 3 / 5 and μ 1 = 99. After this the clusters will not change. (b) [3 Points] Suppose that instead of computing the cluster means at each step, we compute the cluster medians. This algorithm is called K-medians. Starting with the same initial cluster centers as with K-means, what will be the resulting cluster centers? [Sol:] The initial cluster assignments will be the same as above: C 0 = [ - 3 , - 2 , - 1] and C 1 = [1 , 2 , 99]. The medians will be τ 0 = - 2 and τ 1 = 2. For the second cluster assignment the clusters will be the same and hence the medians. Therefore, K- medians has converged. K-medians will ignore outliers since the median is robust to the data where as K-means will result in a separate cluster for the outlier. 2. In a univariate local linear regression, the estimator is of the form: ˆ f ( x ) = β 1 + β 2 x, (13) and the solution for the parameter vector β = [ β 1 , β 2 ] is obtained by minimizing the weighted least squares error: L ( β 1 , β 2 ) = N X i =1 W i ( x )( y i - β 1 - β 2 x i ) 2 , (14) where W i ( x ) = K (( x i - x ) /h ) N j =1 K (( x i - x ) /h ) , (15) and K is a kernel with bandwidth h . Observe that the weighted least squares error can be expressed in the matrix form as L ( β 1 , β 2 ) = ( Y - ) T W ( Y - ) , (16) where Y is the vector of N labels in the training set, W is an N × N diagonal matrix with weight of each training example on the diagonal, and the matrix A is given by A = 1 x 1 . . . . . . 1 x N . (17) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(a) [6 Points] Derive an expression in matrix form for the solution vector ˆ β that minimizes the weighted least squares. [Sol:] We have L ( β ) = ( Y - ) T W ( Y - ) . Differentiating this loss with respect to the parameters β , we get ∂L ( β ) ∂β = 2 A T WAβ - 2 A T W T Y. Therefore the solution ˆ β satisfies the following normal equations: A T WA ˆ β = A T W T Y. And if A T WA is invertible then the solution is given by ˆ β = ( A T WA ) - 1 A T W T Y. Note that W = W T , so the solution can be written in terms of either. (b) [2 Points] Why is this a non-parametric model? Give one advantage and one disadvantage of a non-parametric method over a parametric method. [Sol:] This is a non-parametric model because it performs locally linear fits on the data, and is essentially training one (local) linear model at every point in the training set. Notice that W i ( x ) and hence the solution ˆ β depends on x . Thus we are fitting a different set of linear parameters at every point x . Therefore the total number of parameters can be larger than n. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] The solution obtained by the K -NN algorithm using large values of K will have high bias and low variance? [Sol:] True. A large value of K will lead to smooth decision boundaries which will not change with a small change in the training set. Hence the solution learnt will have low variance and high bias. (b) [3 Points] When training a Gaussian Mixture Model, the gradient descent algo- rithm has a tendency of getting stuck in a local minima, whereas the EM algorithm does not. [Sol:] False. The EM algorithm is essentially doing a coordinate descent optimization over a collection of model parameters and latent variables and the step taken in each iteration is greedy which tries to maximize the likelihood. It stops when one reaches the point of zero gradient with respect to the parameters. There is no guarantee that this point will be a global minima. algorithm will not 13
7 General ML Principles (30 Points) 1. We are provided with a training set D train , a validation set D val , and a test set D test . We are also given a cost function R which takes as input a dataset and a model and returns the cost, e.g., R ( D train , M ), where M is the model. We have at our disposal 100 hypothesis sets H 1 , H 2 , . . . , H 100 . We use gradient descent based iterative optimization algorithm to learn the models. Answer the following questions: (a) [4 Points] Within a hypothesis set H i , the optimization algorithm was run start- ing from a random model M 0 for 1000 steps, resulting in a sequence of 1001 models ( M 0 , M 1 , . . . , M 1000 ). Which of these models should I pick as the final model. [Sol:] The model M i which gives the lowest cost on the validation set D val should be picked as the final model. This cost is given by R ( D val , M i ). (b) [4 Points] Based on the criterion in your answer above, what should I report as its generalization error? [Sol:] The generalization error can be reported as the cost on the test set R ( D test , M i ) (c) [4 Points] I have now selected 100 models ( ˆ M 1 , ˆ M 2 , . . . , ˆ M 100 ), one corresponding to each hypothesis set. How do I decide which one of these 100 models to use? [Sol:] The model ˆ M i which minimizes the cost function R ( D val , ˆ M i ) should be chosen. (d) [4 Points] I need to report how well this selected model will work. What do I report? [Sol:] The performance of a model is reported on the test set. Thus, R ( D test , ˆ M i ) should be reported. 2. [5 Points] Even though Gaussian Process (GP) are modeled as multivariate Gaussian distribution, they are considered as non-parametric models. Explain why (in a few lines). [Sol:] A Gaussian process can be thought of as a function over the data distribution. As the number of data points increase, the parameters of this function increase as well. This function can be hence thought of as having infinite parameters, which is why GPs are considered non-parametric. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] Overfitting is more likely when the hypothesis space is small. [Sol:] False. A small hypothesis space has a high bias and low variance. Thus we are less likely to find a hypothesis from a small hypothesis space to fit the data perfectly and hence overfit. (b) [3 Points] Give N data points, the training error converges to the true error as N → ∞ . [Sol:] True. If we assume the data to be i.i.d then as N → ∞ , the training error will converge to the true error as we would have captured the entire data distribution. 14
(c) [3 Points] In AdaBoost the weighted training error t of the t th weak classifier on training data with weights D t tends to increase as a function of t . [Sol:] True. Over the course of boosting iterations the weak classifiers are forced to try to classify more difficult examples. The weights of the examples which are repeatedly misclassified by the weak classifiers will increase. The weighted training error t of the t th weak classifier on the training data therefore tends to increase. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help