final_exam_questions

pdf

School

New York University *

*We aren’t endorsed by this school

Course

473

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

10

Uploaded by DeanFox3968

Report
Introduction to Machine Learning (CSCI-UA.473): Final Exam Instructor: Sumit Chopra December 21 st , 2021 General Instructions There are 7 Sections to the exam. Each Section has multiple questions. Answering some questions may take longer than others. The score associated with each question is mentioned alongside it. For the True/False questions a 1-2 lines explanation backing your choice is essential. No points will be allotted if the explanation is missing or wrong regardless of whether the binary answer is correct. The exam starts at 2:00PM and will finish at 3:50PM . Additional 15 minutes will be provided for you to compile your work and upload it to GradeScope. Please note that after 4:05PM you will not be allowed to submit your answers. Please manage your time judiciously. Do not spend too much time on any particular question and leave ample time at the end for compiling and uploading your work. Feel free to use as many scratch sheets as you like. You will be required to upload all the scratch sheets as part of your solution. Please submit the pdf of your solution along with the scratch sheets within the allotted time. The solutions can be either hand written notes or typeset in Latex. If you are submitting hand written notes then please compile them into a single pdf file and submit it as part of the solution. When writing your solutions, please clearly indicate the question to which you are responding. For example, if you are addressing Part b of Question 3 of Section 1, then label your answer as 1.3.b. That is, the format should be ¡Section¿.¡Question¿.¡Part¿. 1
There is a strict honor code for this test. The answers should be entirely your own work. Consulting with anyone about this exam in any way while taking the exam is strictly prohibited. Calculators, books or any other online or offline reference material is not allowed. Violations of this policy will be treated very seriously. By printing your name and Net ID below you pledge by these rules. Please upload this sheet along with your solutions to Gradescope. Name: NYU Net ID: 2
1 Probability and Basics (20 Points) 1. Let X and Y be discrete random variables. You are given their joint distribution p(X, Y), and let D denote the training set and θ denote the parameters of your model. Answer the following questions: (a) [2 Points] Write the expression of the marginal distribution over X provided p ( X, Y ). (b) [2 Points] Write the conditional distribution of X given Y provided p ( X, Y ), p ( X ), and p ( Y ). (c) [2 Points] Write the posterior distribution of Y given p ( X | Y ), p ( Y ), and p ( X ). (d) [2 Points] Write the expression of the posterior distribution of the parameters θ , given the prior p ( θ ) and likelihood of the data. 2. [4 Points] Show that the Tanh function (tanh) and the Logistic Sigmoid function ( σ ) are related by tanh( a ) = 2 σ (2 a ) - 1 . 3. [8 Points] Show that a general linear combination of Logistic Sigmoid functions of the form y ( x, w ) = w 0 + M X j =1 w j · σ x - μ j s is equivalent to a linear combination of tanh functions of the form y ( x, u ) = u 0 + M X j =1 u j · tanh x - μ j s and find the expression that relates the new parameters { u 1 , . . . , u M } to the original parameters { w 1 , . . . , w M } . 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2 Parametric Models (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be the training set where each training sample ( x i , y i ) is independently and identically distributed. The model is given by ˆ y i = w T x i + i , where i ∼ N (0 , σ 2 ). Answer the following questions: (a) [4 Points] Let Y = [ y 1 , . . . , y N ] be a vector of all the labels and X = [ x 1 , . . . , x N ] be a matrix of all the inputs. Write the expression for conditional likelihood p ( Y | X ). (b) [8 Points] Assume that the prior distribution of the parameters θ is gaussian: θ ∼ N (0 , β 2 I ). Show that computing the MAP estimate of the parameters is equivalent to minimizing a loss function composed of the mean squared error and an L 2 regularizer. 2. [8 Points] In the following data set A , B , C are input binary random variables, and y is a binary output whose value we want to predict: A B C y 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 How will the Naive Bayes Classifier predict y given the input: A = 0, B = 0, C = 1? 4
3 Support Vector Machines (20 Points) 1. [10 Points] Kernel functions implicitly define a mapping function φ ( · ) that transforms an input instance x ∈ < d to a high dimensional feature space Q by giving the form of a dot product in Q : K ( x i , x j ) = φ ( x i ) · φ ( x j ). Assume we use a kernel function of the form K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ). Thus we assume that there is some implicit unknown function φ ( x ) such that φ ( x i ) · φ ( x j ) = K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ) . Prove that for any two input instances x i and x j , the squared Euclidean distance of their corresponding points in the feature space Q is less than 2. That is prove that k φ ( x i ) - φ ( x j ) k 2 < 2. 2. [4 Points] Let M SV M be an SVM model which you’ve trained using some training set. Consider a new point that is correctly classified and distant from the decision boundary. Why would SVMs decision boundary be unaffected by this point, but the one learnt by logistic regression would be affected. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] If we switch from a linear kernel to a higher order polynomial kernel the support vectors will not change. (b) [3 Points] The maximum margin decision boundary that the SVMs learn provide the best generalization error among all linear classifiers. 5
4 Deep Learning and Neural Network (20 Points) 1. For the binary classification task, the cross-entropy loss is defined by E ( w ) = - N X n =1 { y n log ˆ y n + (1 - y n ) log(1 - ˆ y n ) } , where y n is the true label and ˆ y n is the predicted probability of the correct label given by ˆ y n = f ( x n , w ), and the function f is the neural network with parameters w . The above loss function assumes that the output of the function f is between 0 and 1: 0 f ( x n , w ) 1 (typically obtained by using a Logistic Sigmoid activation on top of the last layer of the neural network), and the data have target values y n ∈ { 0 , 1 } . (a) [6 Points] Derive the corresponding error function if we consider a network having an output - 1 f ( x n , w ) 1 and target values y 1 = 1 for class C 1 and y 2 = - 1 for class C 2 . (b) [2 Points] What would be the appropriate choice of output unit activation func- tion? 2. [3 Points] List any three methods one can use to regularize neural networks? You DO NOT need to describe the methods in detail. A one line description should suffice. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] Convolutional neural networks are more flexible than fully connected neural networks because they can express a wider array of functions corresponding to different settings of their parameters. (b) [3 Points] Consider a 1000 layer fully connected network of the form X n = W n , X n - 1 for n = 1 , . . . , 1000, where X 0 is the input to the network. Let us denote this network by N 1000 . Consider another network (call it N 2 ) which has two hidden layers and is of the following functional form: X 2 = f ( W 2 , f ( W 1 , X 0 )), where the function f () is the tanh function. The network N 1000 is significantly more expressive than network N 2 . (c) [3 Points] Batch gradient descent is less prone to getting stuck in a local minima because it computes the gradients of the loss function with all the examples in the training set before making the gradient update. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
5 Decision Trees, Ensembles, and Boosting (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } are independently and identically drawn training sam- ples from the underlying unknown distribution P . Let D * denote the bootstrap dataset also obtained from the distribution P and consisting of observations x * i , y * i : i = 1 , . . . N . (Note that in practice we cannot do this since we do not have access to the distribution P . We can only simulate it by bootstrap sampling from D .) Let f ag ( x ) denote the ideal aggregate estimate of the model and ˆ f * ( x ) denote the prediction of the model trained on the bootstrapped sample. (a) [6 Points] Show that f ag ( x ) is a better estimate than ˆ f * ( x ). (b) [2 Points] Using the above result as a basis, argue (in no more than a few lines) that Bagging will often lead to a reduction in the mean squared error. 2. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] A two layer neural network with linear activation functions is essentially a weighted combination of linear separators trained on a given data set. The boosting algorithm built on linear separators also finds a linear combination of the linear separators. Therefore these two algorithms are equivalent and will give the same result. (b) [3 Points] Decision trees are learnt by minimizing the information gain. (c) [3 Points] The coefficients α assigned to the classifiers assembled by AdaBoost are always non-negative. (d) [3 Points] One way to prevent a Decision Tree model from over fitting is combin- ing it with other Decision Trees trained on bootstrapped samples of the training set. 7
6 Dimensionality Reduction and Non-Parametric Meth- ods (20 Points) 1. Consider the following 1-dimensional data set X = [ - 3 , - 2 , - 1 , 1 , 2 , 99] . (1) Assume that you want to cluster this dataset into two clusters and you start with the initial means: μ 0 = - 3 and μ 1 = 3. Answer the following questions. (a) [3 Points] What will be the clusters chosen by the k-means algorithm? (b) [3 Points] Suppose that instead of computing the cluster means at each step, we compute the cluster medians. This algorithm is called K-medians. Starting with the same initial cluster centers as with K-means, what will be the resulting cluster centers? 2. In a univariate local linear regression, the estimator is of the form: ˆ f ( x ) = β 1 + β 2 x, (2) and the solution for the parameter vector β = [ β 1 , β 2 ] is obtained by minimizing the weighted least squares error: L ( β 1 , β 2 ) = N X i =1 W i ( x )( y i - β 1 - β 2 x i ) 2 , (3) where W i ( x ) = K (( x i - x ) /h ) N j =1 K (( x i - x ) /h ) , (4) and K is a kernel with bandwidth h . Observe that the weighted least squares error can be expressed in the matrix form as L ( β 1 , β 2 ) = ( Y - ) T W ( Y - ) , (5) where Y is the vector of N labels in the training set, W is an N × N diagonal matrix with weight of each training example on the diagonal, and the matrix A is given by A = 1 x 1 . . . . . . 1 x N . (6) (a) [6 Points] Derive an expression in matrix form for the solution vector ˆ β that minimizes the weighted least squares. (b) [2 Points] Why is this a non-parametric model? Give one advantage and one disadvantage of a non-parametric method over a parametric method. 8
3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] The solution obtained using large values of K will have high bias and low variance? (b) [3 Points] When training a Gaussian Mixture Model, the gradient descent algo- rithm has a tendency of getting stuck in a local minima, whereas the EM algorithm does not. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 General ML Principles (30 Points) 1. We are provided with a training set D train , a validation set D val , and a test set D test . We are also given a cost function R which takes as input a dataset and a model and returns the cost, e.g., R ( D train , M ), where M is the model. We have at our disposal 100 hypothesis sets H 1 , H 2 , . . . , H 100 . We use gradient descent based iterative optimization algorithm to learn the models. Answer the following questions: (a) [4 Points] Within a hypothesis set H i , the optimization algorithm was run start- ing from a random model M 0 for 1000 steps, resulting in a sequence of 1001 models ( M 0 , M 1 , . . . , M 1000 ). Which of these models should I pick as the final model. (b) [4 Points] Based on the criterion in your answer above, what should I report as its generalization error? (c) [4 Points] I have now selected 100 models ( ˆ M 1 , ˆ M 2 , . . . , ˆ M 100 ), one corresponding to each hypothesis set. How do I decide which one of these 100 models to use? (d) [4 Points] I need to report how well this selected model will work. What do I report? 2. [5 Points] Even though Gaussian Process (GP) are modeled as multivariate Gaussian distribution, they are considered as non-parametric models. Explain why (in a few lines). 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] Overfitting is more likely when the hypothesis space is small. (b) [3 Points] Give N data points, the training error converges to the true error as N → ∞ . (c) [3 Points] In AdaBoost the weighted training error t of the t th weak classifier on training data with weights D t tends to increase as a function of t . 10