combinepdf (13)

pdf

School

Arizona State University *

*We aren’t endorsed by this school

Course

92203

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

23

Uploaded by ProfessorWildcat3060

Report
Name: Our problems multiply 2. (10 points) We will consider a neural network with a slightly unusual structure. Let the input x be d 1 and let the weights be represented as k 1 d vectors, W (1) , . . . , W ( k ) . Then the final output is ˆ y = k Y i =1 σ ( W ( i ) x ) = σ ( W (1) x ) · · · σ ( W ( k ) x ) . Define a ( j ) = σ ( W ( j ) x ). (a) What is @ L y, y ) / @ a ( j ) for some j ? Since we have not specified the loss function, you can express your answer in terms of @ L y, y ) / @ ˆ y . Solution: @ L y, y ) @ ˆ y Y i 6 = j σ ( W ( i ) x ) (b) What are the dimensions of @ a ( j ) / @ W ( j ) ? Solution: Because a ( j ) is a scalar, they are the same as for W ( j ) , which is 1 d . (c) What is @ a ( j ) / @ W ( j ) ? (Recall that d σ ( v ) /dv = σ ( v )(1 - σ ( v )).) Solution: a ( j ) (1 - a ( j ) ) x T Page 3
Name: (d) What would the form of a stochastic gradient descent update rule be for W ( j ) ? Express your answer in terms of @ L y, y ) / @ a ( j ) and @ a ( j ) / @ W ( j ) . Use for the step size. Solution: W ( j ) = W ( j ) - @ L y, y ) @ a ( j ) @ a ( j ) @ W ( j ) Page 4
35OBLEM 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
PROBLEM 13
Additional explanation:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
35OBLEM 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Name: Autoencoder 4. (14 points) Otto N. Coder is exploring di erent autoencoder architectures. Consider the fol- lowing autoencoder with input x 2 R d and output y pred 2 R d . The autoencoder has one hidden layer with m hidden units: z (1) , a (1) 2 R m . z (1) = W (1) x + b (1) a (1) = f (1) ( z (1) ) element-wise z (2) = W (2) a (1) + b (2) y pred = f (2) ( z (2) ) element-wise (a) Assume x, z (2) , and y pred have dimensions d 1. Also let z (1) and a (1) have dimensions m 1. What are the dimensions of the following matrices? W (1) b (1) W (2) b (2) m d m 1 d m d 1 Page 10
Name: Otto trains the autoencoder with back-propagation. The loss for a given datapoint x, y is: J ( x, y ) = 1 2 || y pred - y || 2 = 1 2 ( y pred - y ) T ( y pred - y ) . Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of x , y , y pred , W (1) , b (1) , W (2) , b (2) , f (1) , f (2) and any previously computed or provided partial derivative . Also note that: 1. Let @ f (1) / @ z (1) be an m 1 matrix, provided to you. 2. Let @ f (2) / @ z (2) be a d 1 matrix, provided to you. 3. If Ax = y where A is a m n matrix and x is n 1 and y is m 1, then let @ y/ @ A = x. 4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol . (b) Find @ J/ @ y pred , a d 1 matrix. Solution: @ J @ y pred = ( y pred - y ) (c) Find @ J/ @ z (2) , a d 1 matrix. You may use @ J/ @ y pred and for element-wise multipli- cation. Solution: @ J @ z (2) = @ J @ y pred @ y pred @ z (2) = @ J @ y pred @ f (2) @ z (2) Check: ( d 1) = ( d 1) ( d 1) dimensioned arrays; element-wise multiplication. (d) Find @ J/ @ W (2) , a d m matrix. You may use @ J/ @ z (2) . Solution: @ J @ W (2) = @ J @ z (2) @ z (2) @ W (2) T = @ J @ z (2) a (1) T = @ J @ z (2) f (1) ( W (1) x + b (1) ) T Check: ( d m ) = ( d 1)(1 m ) dimensioned arrays. Note that we also accept a (1) even though it was not explicitly provided. Page 11
Name: (e) Write the gradient descent update step for just W (2) for one datapoint ( x, y ) given learning rate and @ J/ @ W (2) . Solution: W (2) := W (2) - @ J ( x, y ) @ W (2) (f) Otto’s friend Bigsby believes that bigger is better. He takes a look at Otto’s neural network and tells Otto that he should make the number of hidden units m in the hidden layer very large: m = 10 d. (Recall that z (1) has dimensions m 1.) Is Bigsby correct? What would you expect to see with training and test accuracy using Bigsby’s approach? Solution: No; training accuracy might be high, but this would likely be due to over- fitting and lead to worse test accuracy. (g) Otto’s other friend Leila says having more layers is better. Let m be much smaller than d . Leila adds 10 more hidden layers all with linear activation before Otto’s current hidden layer (which has sigmoid activation function f (1) ) such that each hidden layer has m units. What would you expect to see with your training and test accuracy, compared to just having one hidden layer with activation f (1) ? Solution: The intermediary hidden layers do not add any expressivity to the network, and we would expect similar training and test accuracy as compared to the single f (1) hidden layer network. This may, however, require di erent number of training iterations with the same available data, in order to achieve similar accuracy. Page 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Name: (h) Another friend Neil suggests to have several layers with non-linear activation function. He says Otto should regularize the number of active hidden units. Loosely speaking, we consider the average activation of a hidden unit j in our hidden layer 1 (which has sigmoid activation function f (1) ) to be the average of the activation of a (1) j over the points x i in our training dataset of size N : ˆ p j = 1 N N i =1 a (1) j ( x i ) . Assume we would like to enforce the constraint that the average activation for each hidden unit ˆ p j is close to some hyperparameter p . Usually, p is very small (say p < 0 . 05). What is the best format for a regularization penalty given hyperparameter p and the average activation for all our hidden units: ˆ p j ? Select one of the following: Hinge loss: j max (0 , (1 - ˆ p j ) p ) p NLL: j - p log p ˆ p j - (1 - p ) log (1 - p ) (1 - ˆ p j ) p Squared loss: j p j - p ) 2 l 2 norm: j p j ) 2 Solution: Either NLL or squared loss should work, encouraging p and ˆ p j to be close. NLL loss might better handle wide range in the magnitudes of ˆ p j . (i) Which pass should Otto compute ˆ p j on? Select one of the following: p Forwards pass Backwards pass Gradient descent step (weight update) pass Page 13
PROBLEM 12
Additional explanation:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Name: Training Neural Networks with Regularization 8. (8 points) In this problem we will investigate regularization for neural networks. Kim constructs a fully connected neural network with L =2 layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of n points { ( x (1) , y (1) ) , ..., ( x ( n ) , y ( n ) ) } . Recall that the update rule for weights W 1 can be specified in terms of step size η and the gradient of the loss function with respect to weights W 1 . This gradient can be expressed in terms of the activations A l , weights W l , pre-activations Z l , and partials ∂L ∂A 2 , ∂A l ∂Z l , for l = 1,2: W 1 := W 1 - η n X i =1 ∂L ( h ( x ( i ) ; W ) , y ( i ) ) ∂W 1 , where h ( · ) is the input-output mapping implemented by the entire neural network, and ∂L ∂W 1 = ∂Z 1 ∂W 1 · ∂A 1 ∂Z 1 · W 2 · ∂A 2 ∂Z 2 · ∂L ∂A 2 . (a) Derive a new update rule for weights W 1 which also penalizes the sum of squared values of all individual weights in the network: L new = L ( h ( x ( i ) ; W ) , y ( i ) ) + λ || W || 2 where λ denotes the regularization trade-off parameter. You can express the new update rule as follows: W 1 := αW 1 - η n X i =1 ∂L ( h ( x ( i ) ; W ) , y ( i ) ) ∂W 1 where L ( · ) represents the previous prediction error loss. What is the value of α in terms of λ and η ? Solution: L new = L + λ X i,j,l ( W l i,j ) 2 ∂L new ∂W 1 = ∂L ∂W 1 + 2 λW 1 W 1 := W 1 - η X ∂L new ∂W 1 W 1 := (1 - 2 λη ) W 1 - η X ∂L ∂W 1 Thus α = 1 - 2 λη . Page 23
Name: (b) Explain how this new update rule helps the neural network reduce overfitting to the data. Solution: For reasonable λ and η , the weights are scaled by a factor less than 1 at each iteration. (If 1 - 2 λη > 1, the weights will rapidly grow and diverge.) A value of | α | < 1 pushes the weights toward zero in general, except those weights that are needed to fit substantial subsets of the data (i.e., those weights that are needed to keep the data loss term L low). (c) Given that we are training a neural network with gradient descent, what happens when we increase the regularization trade-off parameter λ too much, while holding the step size η fixed? Solution: With too large a λ , α may approach zero and the weights would be too strongly penalized and thus tend to zero, preventing the neural network from fitting the available training data. That is to say, the network is pushed towards an overly “generalized” constant output based on zero or near-zero weights. With even larger values of λ , α may become negative causing oscillations in weights. With | α | larger than 1, the weights will grow in magnitude without bound. Page 24
PROBLEM 7
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
change from CHARGE because, even though both CHARGE and EXPLORE appear to cost the same -10 in the table, performing a CHARGE leads to higher reward states. c) Recall that V ( s ) = max a Q ( s, a ). The possible s 0 in this model for s =EMPTY are EMPTY and HIGH. We determine Q 2 ( EMPTY, EXPLORE ) and Q 2 ( EMPTY, CHARGE ) and take the max of the two to be V 2 ( EMPTY ). Q 2 ( EMPTY, EXPLORE ) = - 10 + 0 . 5 * V 1 ( EMPTY ) = - 15 Q 2 ( EMPTY, CHARGE ) = - 10 + 0 . 5 * V 1 ( HIGH ) = - 9 So V 2 ( EMPTY ) = - 9. d) Solutions are fine. 7) a) Recall that the normal vector to a hyperplane is defined as the θ parameter. The hidden sign units in this network perform the same kind linear combination with an offset that our signed hyperplanes use. b) Using the fact that the hidden units act as linear classifiers, we can draw decision boundaries that separate the signed points (orientation is important). Multiple solutions. c) Solutions are fine. d) Solutions are fine. 8) a) Solutions are fine. b) We get the optimal policy from the largest Q 1 ( s, a )-values. c) We can get the V 1 ( s ) values from the largest Q ( s, a ) values. 9) a) Recall that the margin of a dataset with respect to a separator is the minimum margin of all points in the dataset with respect to the separator. Removing a single point can result in the margin of the dataset increasing (if the removed point had the minimum margin) or staying the same (if it did not). b) If there were a separator for the dataset with some of the coordinates omitted, then the same separator (with 0 weights for the omitted coordinates) would still separate the original dataset. c) The setup of this problem is unusual because we are given a fixed set of classifiers H = { h 1 , h 2 , h 3 } and select one based on a training set error. The parameters of the classifiers do not actually depend on the training set, and the test set error is fixed for a particular choice of classifier. We are also told the relative test errors are E ( h 1 ) < E ( h 2 ) < E ( h 3 ). The point of the problem is to see how the function we choose given a training set changes with the number of training points. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help