practice_final_soln

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

M146

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

14

Uploaded by tachyonphantom

Report
CS M146 Practice Final Exam 1 CS M146, Fall 2023 Practice Final Exam Instructions 1. This exam is worth 80 points. You have 3 hours to complete it. 2. You are allowed to consult 2 US Letter/A4 sized page (4 sides) of reference notes. 3. You can quote results from the lecture slides, HWs, discussion problems, and practice problems as part of your solutions to the exam problems. 4. No network access is allowed. 5. All answers must be justified. Unsupported answers will not receive credit. An incorrect answer supported by substantially correct calculations and explanations might still receive partial credit. Blank answers will receive zero credit. Good luck! I attest that I have not given or received aid in this examination, and that I have done my share and taken an active part in seeing to it that others as well as myself uphold the spirit and letter of the UCLA Honor Code. Name : UID : Signature : For instructor use Question Score 1 / 42 2 / 15 3 / 11 4 / 12 Total score: / 80
CS M146 Practice Final Exam 2 1. [42 points total] Short questions There are 5 sub-questions covering a few themes of the course. Each sub-question has 4-5 statements. For each of the 5x1+4x4=21 statements below, state True or False and provide a 1-2 line justification for your choice. 1 point for correct assertion and 1 point for explanation. (a) [10 points] General Machine Learning i. The complexity of a hypothesis class is equal to the number of unknown parameters. Answer: False. Two hypothesis classes can have the same representation power, but different number of parameters. ii. Statistical generative models are trained to learn probability distributions that approximate the data distribution as closely as possible. Answer: True, by definition. iii. Pruning is used to mitigate underfitting in decision trees. Answer: False, pruning helps mitigate overfitting. iv. Both k NN and k -means do not need access to any labels for the input datapoints. Answer: False, k NN needs access to the labels for the training dataset. v. SGD with momentum helps reduce the noise fluctuations in the gradient estimates. Answer: True, SGD with momentum uses moving average estimates to reduce the gradient noise.
CS M146 Practice Final Exam 3 (b) [8 points] Support Vector Machines i. Consider 2 hyperplanes with parameter vectors θ and 5 θ respectively. The margin of any datapoint x w.r.t. the second hyperplane is larger than its margin w.r.t. the first hyperplane. Answer: False. Margin is invariant to scale and is the same for both hyperplanes. ii. The soft-margin SVM optimization problem has n constraints. Answer: False. Soft-margin SVM has 2 n constraints. iii. For a classifier f θ ( x ) = sign( θ T x ) with fixed parameter vector θ , the hinge loss at any training instance is always greater than or equal to the perceptron loss. Answer: True. Hinge loss is greater than perceptron loss when y θ T x < 1 and equal other- wise. iv. The hard margin SVM and the soft margin SVM with C = 0 will always have the same solution no matter how the data is distributed. Answer: False. Hard margin SVM is infeasible for linearly inseparable data but soft margin SVM is feasible.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Final Exam 4 (c) [8 points] Neural Networks and Deep Learning i. Dropout is a regularization technique for deep learning in which we randomly deactivate hidden neurons during training and testing. Answer: False. We only randomly deactivate hidden neurons in the training stage. In the testing stage, we have the full network. ii. A single filter in a convolutional layer applied over an input image always has the same number of weight parameters as a fully connected layer applied to the same input image (flattened to a vector). Answer: False. Filters are applied to small windows of the input and have fewer parameters. iii. Computations during the forward pass are never used for computing gradients during back- propagation. Answer: False. We often need to cache the computations for efficient computation of the gradients. iv. In an autoregressive generative model for an object x R 3 , the joint distribution is factorized as: p ( x ) = p ( x 1 | x 2 , x 3 ) p ( x 2 | x 1 , x 3 ) p ( x 3 | x 1 , x 2 ) Answer: False. The joint distribution is factorized as p ( x ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ).
CS M146 Practice Final Exam 5 (d) [8 points] Ensemble Models i. In both bagging and boosting, the base models that are part of the ensemble have to be trained sequentially. Answer: False. Bagging can be parallelized. The different models are fitted independently from each other. Boosting can not be parallelized, and it can become too expensive to fit several complex models sequentially. ii. Bootstrap replicates create multiple copies of the training dataset by sampling training in- stances without replacement. Answer: False. The training instances are sampled with replacement. iii. Ensembles will yield bad results if there is significant diversity in the predictions of the base models. Answer: False, all individual models have meaningful predictions. The goal is to achieve diversity in ensembles. iv. Suppose you are working on a binary classification problem and there are three base models each with 70% accuracy. Suppose you want to ensemble these models using the majority voting method. The maximum possible accuracy you can get is 100%. Answer: True. It is possible that each example is predicted correctly by two base models and wrongly by the third base model, given that all three base models have 70% accuracy.
CS M146 Practice Final Exam 6 (e) [8 points] General Unsupervised Learning i. For k -means clustering, we should set the number of clusters k = n so that we minimize the k -means inertia objective function on the training dataset. Here n is the number of datapoints in the training dataset. Answer: False. We should select an optimal k using a downstream performance measure or by selecting the k at which the drop in inertia slows down. ii. k -means clustering always converges to the same clustering regardless of the initialization of the cluster centers. Answer: False. k -means clustering is sensitive to the initialization of the cluster centers. iii. In PCA, the smaller the eigenvalue λ i is, the less important the corresponding principal component v i . Answer: True. The relative importance of each principal component v i increases/decreases, as its corresponding eigenvalue λ i increases/decreases. iv. In PCA, if we use d principal components to reconstruct a data point x R d , then, we will get a perfect reconstruction ˜ x , i.e. reconstruction ˜ x = x . Answer: True. Since the data point is d -dimensional and we use m = d principal components for the reconstruction, the reconstruction error will be 0.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Final Exam 7 2. [15 points total] AdaBoost Suppose we have a dataset with the following examples ( x 1 , x 2 ) R 2 in Table 1 ( i is the example index). The dataset is also visualized for convenience in Figure 1 . 1st iteration 2nd iteration final i x 1 x 2 Label w 1 ˆ y 1 w 2 ˆ y 2 ˆ y (1) (2) (3) (4) (5) (6) (7) (8) (9) 1 1 2 2 2 3 + 3 2 2 + 4 3 3 + 5 3 4 + 6 3 2 7 3 1 + 8 4 1 Table 1: Dataset and results for AdaBoost Figure 1: Adaboost Dataset We would like to separate true and false classes with AdaBoost. Specifically, we will use two rounds of AdaBoost to learn a classifier, which chooses a weak learner in each round that minimizes the weighted error ϵ . The weak learners are linear separators with bias, i.e., h θ ( x 1 , x 2 ) = sign( θ T x ) = sign( θ 0 + θ 1 x 1 + θ 2 x 2 ) . Training error is 0-1 classification error with weights w i for iteration i . When using log, use base e. Answer:
CS M146 Practice Final Exam 8 1st iteration 2nd iteration final i x 1 x 2 Label w 1 ˆ y 1 w 2 ˆ y 2 ˆ y (1) (2) (3) (4) (5) (6) (7) (8) (9) 1 1 2 0.125 + 0.25 2 2 3 + 0.125 + 0.083 + + 3 2 2 + 0.125 + 0.083 + + 4 3 3 + 0.125 + 0.083 + + 5 3 4 + 0.125 + 0.083 + + 6 3 2 0.125 0.083 + + 7 3 1 + 0.125 0.25 + + 8 4 1 0.125 0.083 + + Table 2: Dataset and results for AdaBoost (a) [1 points] Start the first round with a uniform distribution w 1 , place the value of w 1 for each example in the 5th column of Table 1 . (b) [3 points] Pick the base model h θ 1 from the following candidates that minimizes the training error ϵ 1 given data in Table 1 : i. h θ 1 ( x 1 , x 2 ) = sign( x 1 2 . 5) ii. h θ 1 ( x 1 , x 2 ) = sign( x 2 1 . 5) iii. h θ 1 ( x 1 , x 2 ) = sign( x 1 + 2 x 2 + 0 . 5) iv. h θ 1 ( x 1 , x 2 ) = sign( 3 x 1 + 2 x 2 + 4) Write down your formula of h θ 1 below. Then write down the predicted value of h θ 1 ( x 1 , x 2 ) = sign( θ T x ) for each example in the 6th column. Answer: h θ 1 ( x 1 , x 2 ) = sign( 3 x 1 + 2 x 2 + 4) (c) [2 points] Compute training error ϵ 1 and classifier weight β 1 for the best hypothesis h θ 1 ( x 1 , x 2 ) in 1st iteration below. Answer: ϵ 1 = 2 / 8 = 0 . 25 β 1 = 1 2 ln 1 ϵ 1 ϵ 1 = 1 2 ln 3 0 . 549 (d) [2 points] Based on the prediction from the first iteration, derive w 2 for each example below and fill in the computed values in the 7th column. Answer: Using β 1 (base e) to compute the new distribution, we get: w 2 ,i = ( 1 Z 0 w 1 ,i e β 1 if h θ 1 ( x ( i ) ) = y ( i ) 1 Z 0 w 1 ,i e β 1 if h θ 1 ( x ( i ) ) ̸ = y ( i ) The new distribution D 2 is thus w 2 ,i = ( 1 / 12 0 . 083 h θ 1 ( x ( i ) ) = y ( i ) 0 . 25 if h θ 1 ( x ( i ) ) ̸ = y ( i )
CS M146 Practice Final Exam 9 (e) [3 points] Pick the base model h θ 2 from the following candidates that minimizes the weighted training error (using weights w 2 ): i. h θ 2 ( x 1 , x 2 ) = sign( x 1 3 . 5) ii. h θ 2 ( x 1 , x 2 ) = sign( x 2 2 . 5) iii. h θ 2 ( x 1 , x 2 ) = sign( x 1 1 . 5) iv. h θ 2 ( x 1 , x 2 ) = sign( x 2 1 . 5) Write down your formula of h θ 2 below. Compute the predicted value of h θ 2 ( x 1 , x 2 ) for each example and fill in the 8th column. Answer: The hypothesis is h θ 2 ( x 1 , x 2 ) = sign( x 1 1 . 5) (f) [2 points] Compute the weighted training error ϵ 2 (using weights w 2 ) and classifier weight β 2 for the best hypothesis h θ 2 ( x 1 , x 2 ) in 2nd iteration below. Answer: ϵ 2 = 1 / 12 2 = 0 . 167 β 2 = 1 2 ln 1 ϵ 2 ϵ 2 = 1 2 ln 5 0 . 805 (g) [2 points] What is the final hypothesis H ( x 1 , x 2 ) produced by AdaBoost? Write down the final predictions in the 9th column. Answer: The final hypothesis is H ( x 1 , x 2 ) = sign(0 . 549 sign( 3 x 1 + 2 x 2 + 4) + 0 . 805 sign( x 1 1 . 5))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Final Exam 10 3. [11 points total] PCA You have the following 3-dimensional data: Data point # 1 2 3 4 5 x 1 0.4 -2.6 -0.6 1.4 1.4 x 2 0.2 -2.8 -0.8 2.2 1.2 x 3 -1.8 3.2 0.2 -3.8 2.2 Our goal is to reduce the dimensionality of the data from 3 to 2 using the principal component analysis (PCA) algorithm. The unsorted principal components are given as: p 1 = [ 0 . 7173 , 0 . 6930 , 0 . 0722] T , p 2 = [0 . 5690 , 0 . 5228 , 0 . 6347] T , p 3 = [ 0 . 4021 , 0 . 4964 , 0 . 7694] T . [ Tip: Please provide formulas/workings for each answer. In the case your numerical workings are incorrect, it can help us provide partial credit.] (a) [1 points] What is the mean vector µ for the above dataset? Answer: µ = [0 , 0 , 0] (b) [4 points] Which are the leading two principal components based on the PCA algorithm? [ Hint: Recall from linear algebra, for any square matrix M , an eigenvector v is any non-zero vector such that M v = λ v . Here, λ denotes the eigenvalue corresponding to the eigenvector v .] Answer: To select the two leading principal components, we need to sort them according to their corresponding eigenvalues. To do that, we need to first get the covariance matrix A = 2 . 8 3 . 15 2 . 85 3 . 15 3 . 7 3 . 8 2 . 85 3 . 8 8 . 2 . The given principal components are eigenvectors of A , so we can find their corresponding eigen- values by comparing A p i with p i : A p 1 = [ 0 . 03 , 0 . 03 , 0 . 00] T = 0 . 04 p 1 , λ 1 = 0 . 04 A p 2 = [1 . 43 , 1 . 31 , 1 . 60] T = 2 . 52 p 2 , λ 2 = 2 . 52 A p 3 = [ 4 . 88 , 6 . 03 , 9 . 34] T = 12 . 14 p 3 , λ 3 = 12 . 14 So we should select p 3 and p 2 . (c) [2 points] What are the explained variances for the leading two principal components? Answer: The explained variance of p 3 is λ 3 λ 1 + λ 2 + λ 3 = 12 . 14 0 . 04+2 . 52+12 . 14 = 0 . 83, the explained variance of p 2 is λ 2 λ 1 + λ 2 + λ 3 = 2 . 52 0 . 04+2 . 52+12 . 14 = 0 . 17 (d) [2 points] What is the reduced principal representation y (1) R 2 of the data x (1) ? Answer: y (1) = [ p 3 , p 2 ] T x (1) = 0 . 4021 0 . 4964 0 . 7694 0 . 5690 0 . 5228 0 . 6347 0 . 4 0 . 2 1 . 8 = [ 1 . 65 , 0 . 81] T
CS M146 Practice Final Exam 11 (e) [2 points] What is the reconstruction ˜ x (1) R 3 given its principal representation y (1) ? Answer: ˜ x (1) = [ p 3 , p 2 ] y (1) = 0 . 4021 0 . 5690 0 . 4964 0 . 5228 0 . 7694 0 . 6347 1 . 65 0 . 81 = [0 . 20 , 0 . 40 , 1 . 78] T
CS M146 Practice Final Exam 12 4. [12 points total] Neural Networks Suppose you have the following neural network f θ ( x ) described by the equations below: h 1 = ReLU( W 1 , (1 , 1) x 1 + W 1 , (1 , 2) x 2 + b 1 , 1 ) h 2 = ReLU( W 1 , (2 , 1) x 1 + W 1 , (2 , 2) x 2 + b 1 , 2 ) f θ ( x ) = σ ( w 2 , 1 h 1 + w 2 , 2 h 2 + b 2 ) , where ReLU( z ) = max(0 , z ) is the ReLU activation function, σ ( z ) = 1 1+ e z is the sigmoid activation function, and x = ( x 1 , x 2 ) T R 2 is the input to the network, W 1 R 2 × 2 are the weights of the first layer, W 1 , ( i,j ) is the element of W 1 at row i , column j , where i, j ∈ { 1 , 2 } , w 2 R 2 are the weights of the second layer, w 2 ,i is the i th element of w 2 , where i ∈ { 1 , 2 } , b 1 R 2 is the bias vector for the first layer, b 1 ,i is the i th element of b 1 , where i ∈ { 1 , 2 } , b 2 R is the bias for the second layer, and θ refers to the set of all parameters (biases b 1 , b 2 and weights W 1 , w 2 ) in the network. The computation graph for the network is shown in Figure 3 . For this problem, use the following values: x 1 = 1 , x 2 = 2, W 1 , (1 , 1) = 1 , W 1 , (1 , 2) = 2 , W 1 , (2 , 1) = 3 , W 1 , (2 , 2) = 1, w 2 , 1 = 3 , w 2 , 2 = 1, b 1 , 1 = 2 , b 1 , 2 = 2, b 2 = 3. For the following questions, you do not need to show any working beyond filling in all the blanks in Figure 3 . (a) [4 points] Compute the forward pass for f θ ( x ) by filling in the first blank for each outgoing edge in the computation graph (values that are given above have been filled in as examples). Answer: See Figure 2 for computation graph with forward pass and backward pass values. (b) [8 points] Compute the backward pass for f θ ( x ) by filling in the second blank for each outgoing edge in the computation graph. Answer: See Figure 2 for computation graph with forward pass and backward pass values. Note that the derivative of σ ( z ) is ∂σ ( z ) ∂z = σ ( z ) (1 σ ( z )) and the derivative of ReLU( z ) = max(0 , z ) is ReLU( z ) ∂z = ( 1 if z 0 0 if z < 0 .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
CS M146 Practice Final Exam 13 x 1 * W 1 , (1 , 1) b 1 , 1 + ReLU x 2 * * w 2 , 1 W 1 , (1 , 2) x 1 b 2 + σ f θ ( x ) * W 1 , (2 , 1) b 1 , 2 + ReLU x 2 * * w 2 , 2 W 1 , (2 , 2) 1, 0 1, 0 2, 0 -2, 0 1, 0 -4, 0 2, 0 -1, 0 0, 3 4 -3, 0 0, 1 4 3, 1 4 0, 1 4 1 2 , 1 1, - 3 4 3, - 1 4 2, - 1 4 1, - 1 2 3, - 1 4 2, - 1 4 -2, - 1 4 3, - 1 4 3, - 1 4 -1, 3 4 -3, 1 4 Figure 2: Computation graph for neural network with forward pass (in black) and backward pass (in red) values. Nodes with ReLU refer to the ReLU activation function, i.e. ReLU( z ) = max(0 , z ). The node with σ refers to the sigmoid activation function, i.e. σ ( z ) = 1 1+ e z .
CS M146 Practice Final Exam 14 x 1 * W 1 , (1 , 1) b 1 , 1 + ReLU x 2 * * w 2 , 1 W 1 , (1 , 2) x 1 b 2 + σ f θ ( x ) * W 1 , (2 , 1) b 1 , 2 + ReLU x 2 * * w 2 , 2 W 1 , (2 , 2) 1, 1, 2, -2, , , 2, , , -3, , 3, , , 1 1, 3, 2, 1, , , -2, , , -1, , Figure 3: Computation graph for neural network. Nodes with ReLU refer to the ReLU activation function, i.e. ReLU( z ) = max(0 , z ). The node with σ refers to the sigmoid activation function, i.e. σ ( z ) = 1 1+ e z . Fill in the blanks for each edge with values f, b where f, b are the forward pass (part a) and backward pass (part b) values for the edge. Values given in the problem have been indicated on the computation graph.