HW0_solutions (1)

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M146

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

7

Uploaded by phanta_ch

Report
CS M146: Introduction to Machine Learning UCLA Fall Quarter 2023 Prof. Aditya Grover Release: 28 Sep 2023 Homework 0 Due: 5 Oct 2023, Thursday, before 11:59 pm Except for the programming questions, you should be able to solve these problems by hand. Problem 1 (Multivariate Calculus) Consider y = x sin( z ) e x . What is the partial derivative of y with respect to x ? Solution: ∂y ∂x = sin( z ) ( xe x + e x ) = (1 x ) sin( z ) e x Problem 2 (Linear Algebra) (a) Consider the matrix X and the vectors y and z below: X = 2 4 1 3 y = 1 3 z = 2 3 i. What is the inner product y T z ? Solution: y T z = 1 · 2 + 3 · 3 = 11 ii. What is the product X y ? Solution: X y = 2 · 1 + 4 · 3 1 · 1 + 3 · 3 = 14 10 iii. Is X invertible? If so, give the inverse; if not, explain why not. Solution: det( X ) = 2 · 3 4 · 1 = 2 ̸ = 0 so X is invertible. For a 2 × 2 matrix, A 1 = a b c d 1 = 1 det( A ) d b c a = 1 ad bc d b c a . Thus, X 1 = 1 2 3 4 1 2 . 1
iv. What is the rank of X ? Solution: The rows (or equivalently, columns) of X are linearly independent, so rank( X ) = 2. (b) Vector Norms [4 pts] Draw the regions corresponding to vectors x R 2 with following norms (you can hand draw or use software for this question): Solution: Each l p -norm corresponds to a unit ball. In R 2 , this corresponds to a disc of diameter 2 centered at the origin. i. || x || 2 1 (Recall || x || 2 = q i x 2 i .) Solution: the unit circle (origin-centered circle with diameter of 2), and its interior ii. || x || 0 1 (Recall || x || 0 = i : x i ̸ =0 1.) Solution: the axes iii. || x || 1 1 (Recall || x || 1 = i | x i | .) Solution: origin-centered diamond with diameter of 2, and its interior iv. || x || 1 (Recall || x || = max i | x i | .) Solution: origin-centered square with side length of 2, and its interior (c) Matrix Decompositions [6 pts] i. Give the definition of the eigenvalues and the eigenvectors of a square matrix. Solution: An eigenvector of a square matrix is a vector that points in a direction that is invariant under the associated linear transformation. In other words—, if v is a vector that is not zero, then it is an eigenvector of a square matrix A if A v is a scalar multiple of v . This condition could be written as the equation A v = λ v , where λ is a number (scalar) known as the eigenvalue associated with the eigenvector v . ii. Find the eigenvalues and eigenvectors of A = 2 1 1 2 . Solution: The eigenvectors v of this transformation satisfy the equation A v = λ v . Rearrange this equation to obtain ( A λ I ) v = 0, which has a solution only when its determinant | A λ I | equals zero. Set the determinant to zero to obtain the polynomial equation, p ( λ ) = | A λ I | = (2 λ ) 2 1 = ( λ 2 4 λ + 4) 1 = λ 2 4 λ + 3 = ( λ 3)( λ 1) = 0 , known as the characteristic polynomial of the matrix A . In this case, it has the roots λ = 1 and λ = 3. For λ = 1, the equation becomes, ( A I ) v = 1 1 1 1 v 1 v 2 = 0 0 , which has the solution, v = 1 1 . 2
For λ = 3, the equation becomes, ( A 3 I ) w = 1 1 1 1 w 1 w 2 = 0 0 , which has the solution, w = 1 1 . Thus, the vectors v and w are eigenvectors of A associated with the eigenvalues λ = 1 and λ = 3, respectively. iii. For any positive integer k , show that the eigenvalues of A k are λ k 1 , λ k 2 , . . . , λ k n , the k th powers of the eigenvalues of matrix A , and that each eigenvector of A is still an eigen- vector of A k . Solution: We will prove by induction. Base case (given): A v = λ v Inductive step: Prove that if A k v = λ k v , then A k +1 v = λ k +1 v . This holds because A k +1 v = A ( A k v ) = A ( λ k v ) = λ k ( A v ) = λ k +1 v . Since both the basis and the inductive step have been performed, by mathematical induction, the statement A k v = λ k v holds for all positive integer k . Q.E.D. (d) Vector and Matrix Calculus [5 pts] Consider the vectors x and a and the symmetric matrix A . i. What is the first derivative of a T x with respect to x ? Solution: Let a i and x i denote the elements of a and x , respectively. Then f ( x ) = a T x = n i =1 a i x i so ∂f ( x ) ∂x k = ∂x k n i =1 a i x i = a k and x a T x = a . ii. What is the first derivative of x T A x with respect to x ? What is the second derivative? Solution: Let a ij denote the element in row i , column j of A . Then f ( x ) = x T A x = n i =1 n j =1 x i a ij x j so ∂f ( x ) ∂x k = ∂x k n X i =1 n X j =1 a ij x i x j = ∂x k X i ̸ = k X j ̸ = k a ij x i x j + X i ̸ = k a ik x i x k + X j ̸ = k a kj x k x j + a kk x 2 k = 0 + X i ̸ = k a ik x i + X j ̸ = k a kj x j + 2 a kk x k = n X i =1 a ik x i + n X j =1 a kj x j = 2 n X i =1 a ki x i where the last equality follows from a ij = a ji . Note that the k th entry of x f ( x ) is just the inner product of the k th row of A and x ; therefore, x x T A x = 2 A x . To find the second derivative, 2 f ( x ) ∂x k x = ∂x k ∂f ( x ) ∂x = ∂x k " 2 n X i =1 a ℓi x i # = 2 a ℓk = 2 a kℓ Therefore, 2 x x T A x = 2 A . 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(e) Geometry [5 pts] i. Show that the vector w is orthogonal to the line w T x + b = 0. (Hint: Consider two points x 1 , x 2 that lie on the line. What is the inner product w T ( x 1 x 2 )?) Solution: This line is all x such that w T x + b = 0. Consider two such points, called x 1 and x 2 . Note that x 1 x 2 is a vector parallel to our line. Also note that w T x 1 + b = 0 = w T x 2 + b = w T x 1 = w T x 2 = w T ( x 1 x 2 ) = 0 , which shows that the vector w is orthogonal to our line. ii. Argue that the distance from the origin to the line w T x + b = 0 is b || w || 2 . Solution: We can show this by first finding the closest point to the origin that lies on this line, and then finding the distance to this point. Let a be the closest point to the origin the lies on the line. We can write a as a = min a T a s.t. w T a + b = 0 So we first solve this constrained optimization problem and find a . We start by taking the derivative of the objective, setting it to zero, and using Lagrange multipliers (i.e. setting the derivative of the Lagrangian a T a λ ( w T a + b ) to zero). We can write a a T a λ ( w T a + b ) = 2 a λ w = 0 to find that a = λ 2 w . Hence, plugging this value for a into the constraint, we can write w T a + b = w T λ 2 w + b = λ 2 w T w + b = 0 Thus, λ = 2 b w T w a = b w T w w Once we have a , we can compute the distance between a and the origin to get distance = || a || = q ( a ) T a = s b w T w 2 w T w = b w T w w T w = b w T w = b || w || Problem 3 (Probability and Statistics) (a) Consider a sample of data S obtained by flipping a coin five times. X i , i ∈ { 1 , . . . , 5 } is a random variable that takes a value 0 when the outcome of coin flip i turned up heads, and 1 when it turned up tails. Assume that the outcome of each of the flips does not depend on the outcomes of any of the other flips. The sample obtained S = ( X 1 , X 2 , X 3 , X 4 , X 5 ) = (1 , 1 , 0 , 1 , 0). 4
i. What is the sample mean for this data? Solution: µ = 1 n n i =1 X i = 3 5 ii. What is the unbiased sample variance ? Solution: σ 2 = 1 n 1 n i =1 ( X i µ ) 2 = 1 4 3 ( 1 3 5 ) 2 + 2 ( 0 3 5 ) 2 = 1 4 ( 3 · 4 25 + 2 · 9 25 ) = 3 10 Give partial credit if student uses the biased estimator – σ 2 = 1 n n i =1 ( X i µ ) 2 iii. What is the probability of observing this data assuming that a coin with an equal probability of heads and tails was used? ( i.e. , The probability distribution of X i is P ( X i = 1) = 0 . 5, P ( X i = 0) = 0 . 5.) Solution: P ( S ) = ( 1 2 ) 5 = 1 32 iv. Note the probability of this data sample would be greater if the value of the probability of heads P ( X i = 1) was not 0 . 5 but some other value. What is the value that maximizes the probability of the sample S ? [Optional: Can you prove your answer is correct?] Solution: To maximize the probability of sample S , p should take on the sample mean: p = 3 5 . Optional: Let p = P ( X = 1). We wish to find the value of p , 0 p 1 that maximizes the likelihood L ( p ) = p 3 (1 p ) 2 . To find the critical points, find p such that dL dp p = 0. dL dp = 2 p 3 (1 p ) + 3 p 2 (1 p ) 2 = p 2 (1 p )( 2 p + 3(1 p )) = p 2 (1 p )(3 5 p ) p = 0 , 3 5 , 1 Thus, L attains a minimum of 0 at p = 0 , 1, and L attains a maximum of ( 3 5 ) 3 ( 2 5 ) 2 = 108 3125 at p = 3 5 . (We could also have computed d 2 L dp 2 to determine the minimum and maximum.) v. Given the following joint distribution between X and Y , what is P ( X = T | Y = b )? P ( X, Y ) Y a b c X T 0 . 2 0 . 1 0 . 2 F 0 . 05 0 . 15 0 . 3 Solution: P ( X = T | Y = b ) = P ( X = T,Y = b ) P ( Y = b ) = 0 . 1 0 . 1+0 . 15 = 0 . 4. (b) Match the distribution name to its formula. (a) Gaussian (i) p x (1 p ) 1 x , when x ∈ { 0 , 1 } ; 0 otherwise (b) Exponential (ii) 1 b a when a x b ; 0 otherwise (c) Uniform (iii) ( n x ) p x (1 p ) n x (d) Bernoulli (iv) λe λx when x 0; 0 otherwise (e) Binomial (v) 1 (2 π ) σ 2 exp ( 1 2 σ 2 ( x µ ) 2 ) Solution: a-v, b-iv, c-ii, d-i, e-iii (c) What is the mean and variance of a Bernoulli ( p ) random variable? Solution: The PMF for the Bernouilli distribution is f ( k ; p ) = ( p if k = 1 1 p if k = 0 = p k (1 p ) 1 k for k ∈ { 0 , 1 } 5
Let q = 1 p . Then µ = E [ X ] = X k ∈{ 0 , 1 } kf ( k ; p ) = (0)( q ) + (1)( p ) = p σ 2 = Var( X ) = E ( X µ ) 2 = E X 2 ( E [ X ]) 2 = (0) 2 ( q ) + (1) 2 ( p ) ( p ) 2 = p (1 p ) = pq H = E [ ln( f ( k ; p ))] = ( ln q )( q ) + ( ln p )( p ) = q ln q p ln p (d) If the variance of a zero-mean random variable X is σ 2 , what is the variance of 2 X ? What about the variance of X + 2? Solution: If all values are scaled by a constant, the variance is scaled by the square of that constant: Var( aX ) = a 2 Var( X ). Thus, Var(2 X ) = 4Var( X ) = 4 σ 2 . Variance is invariant with respect to changes in a location parameter; that is, if a constant is added to all values of the variable, the variance is unchanged: Var( X + a ) = Var( X ). Thus, Var( X + 2) = Var( X ) = σ 2 . Problem 4 (Algorithms) Big-O notation For each pair ( f, g ) of functions below, list which of the following are true: f ( n ) = O ( g ( n )), g ( n ) = O ( f ( n )), or both. Briefly justify your answers. (a) f ( n ) = ln ( n ) , g ( n ) = lg ( n ). Note that ln denotes log to the base e and lg denotes log to the base 2. Solution: Both since both functions are equivalent upto a multiplicative constant (b) f ( n ) = 3 n , g ( n ) = n 10 Solution: g ( n ) = O ( f ( n )) since f ( n ) grows much more rapidly as n becomes large. (c) f ( n ) = 3 n , g ( n ) = 2 n Solution: g ( n ) = O ( f ( n )) since f ( n ) grows much more rapidly as n becomes large. Problem 5 (Programming Skills) Start familiarizing yourself with the Python libraries numpy and matplotlib by completing the following exercises. You may find the following references helpful: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_ normal.html http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html You do not have to submit code, simply paste screenshots of the code and the plots in your pdf. (a) Sampling from multivariate probability distributions i. Draw 1000 samples x = ( x 1 x 2 ) from a 2-dimensional Gaussian distribution with mean ( 0 0 ) and identity covariance matrix, i.e. p ( x ) = 1 (2 π ) 2 exp || x || 2 2 , and make a scatter plot ( x 1 vs x 2 ). Solution: See attached code. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
ii. How does the scatter plot change if the mean is ( 1 1 ) ? Solution: See solution code. The data moves up and to the left. Namely, the center of the data moves from roughly [0 , 0] to roughly [ 1 , 1]. iii. How does the (original) scatter plot change if you double the variance of each component? Solution: See solution code. The data become more “spread out”. iv. How does the (original) scatter plot change if the covariance matrix is changed to ( 1 0 . 5 0 . 5 1 ) ? Solution: See solution code. The data become skewed so that they stretch from the lower left to the upper right. v. How does the (original) scatter plot change if the covariance matrix is changed to ( 1 0 . 5 0 . 5 1 ) ? Solution: See solution code. The data become skewed so that they stretch from the upper left to the bottom right. (b) There are now lots of really interesting data sets publicly available to play with. They range in size, quality and the type of features and have resulted in many new machine learning techniques being developed. Find a public, free, supervised (i.e. it must have features and labels), machine learning dataset. For example, you can use one of the open datasets released by the US government here: https://catalog.data.gov/dataset Once you have found the data set, provide the following information: i. The name of the data set and its link. ii. A brief (i.e. 2-3 sentences) description of the data set including what the features are and what is being predicted. iii. The number of examples in the data set. iv. The number of features for each example. For this question, do not just copy and paste the description from the website or the paper; reference it, but use your own words. Your goal here is to understand the data set, where it came from, and potential issues involved. 7