Machine Learning Final Exam Solutions: Comprehensive Guide

Introduction to Machine Learning (CSCI-UA.473): Final Exam Solutions Instructor: Sumit Chopra December 21 st , 2021 1 Probability and Basics (20 Points) 1. Let X and Y be discrete random variables. You are given their joint distribution p(X, Y), and let D denote the training set and θ denote the parameters of your model. Answer the following questions: (a) [2 Points] Write the expression of the marginal distribution over X provided p ( X, Y ). [Sol:] p ( X ) = X y p ( X, y ) (1) (b) [2 Points] Write the conditional distribution of X given Y provided p ( X, Y ), p ( X ), and p ( Y ). [Sol:] p ( X | Y ) = p ( X, Y ) p ( Y ) (2) (c) [2 Points] Write the posterior distribution of Y given p ( X | Y ), p ( Y ), and p ( X ). [Sol:] p ( Y | X ) = p ( X, Y ) p ( X ) = p ( X | Y ) · p ( Y ) p ( X ) (3) (d) [2 Points] Write the expression of the posterior distribution of the parameters θ , given the prior p ( θ ) and likelihood of the data. [Sol:] p ( θ |D ) = p ( D , θ ) p ( D ) = p ( D| θ ) p ( θ ) p ( D ) (4) 2. [4 Points] Show that the Tanh function (tanh) and the Logistic Sigmoid function ( σ ) are related by tanh( a ) = 2 σ (2 a ) - 1 . 1

[Sol:] 2 σ (2 a ) - 1 = 2 1 + e - 2 a - 1 = 2 1 + e - 2 a - 1 + e - 2 a 1 + e - 2 a = 1 - e - 2 a 1 + e - 2 a = e a - e - a e a + e - a = tanh( a ) 3. [8 Points] Show that a general linear combination of Logistic Sigmoid functions of the form y ( x, w ) = w 0 + M X j =1 w j · σ x - μ j s is equivalent to a linear combination of tanh functions of the form y ( x, u ) = u 0 + M X j =1 u j · tanh x - μ j s and find the expression that relates the new parameters { u 1 , . . . , u M } to the original parameters { w 1 , . . . , w M } . [Sol:] If we take a j = ( x - μ j ) / 2 s , we can re-write the first equation as y ( x, w ) = w 0 + M X j =1 w j σ (2 a j ) = w 0 + M X j =1 w j 2 (2 σ (2 a j ) - 1 + 1) = u 0 + M X j =1 u j tanh( a j ) , where u j = w j / 2 for j = 1 , . . . , M and u 0 = w 0 + ∑ M j =1 w j / 2 2

2 Parametric Models (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be the training set where each training sample ( x i , y i ) is independently and identically distributed. The model is given by ˆ y i = w T x i + i , where i ∼ N (0 , σ 2 ). Answer the following questions: (a) [4 Points] Let Y = [ y 1 , . . . , y N ] be a vector of all the labels and X = [ x 1 , . . . , x N ] be a matrix of all the inputs. Write the expression for conditional likelihood p ( Y | X ). [Sol:] Y | X ∼ N ( W T X, σ 2 ) p ( Y | X ) = N Y i =1 1 √ 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 (b) [8 Points] Assume that the prior distribution of the parameters θ is gaussian: θ ∼ N (0 , β 2 I ). Show that computing the MAP estimate of the parameters is equivalent to minimizing a loss function composed of the mean squared error and an L 2 regularizer. [Sol:] θ MAP = arg max θ [ p ( Y | X, θ ) · p ( θ )] We already have shown the expression for p ( Y | X, θ ) in the previous part and p ( θ ) is provided as the gaussian N (0 , β 2 I ). θ MAP = arg min θ N Y i 1 √ 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 · 1 p 2 πβ 2 e - θ T θ 2 β 2 Now we can convert the argmax operation to an argmin by taking the negative log and to further simplify remove the constants, θ MAP = arg min θ N X i ( w T x i - y i ) 2 2 σ 2 + 1 2 β θ T θ Let λ = σ 2 β , thus we have : θ MAP = arg min θ 1 2 n X i ( w T x i - y i ) 2 + λθ T θ 3

Your preview ends here