Skip to main content

Documents Computer Science

combinepdf (13).pdf

combinepdf (13)

pdf

School

Arizona State University *

*We aren’t endorsed by this school

Course

92203

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

23

Uploaded by ProfessorWildcat3060

Name: Our problems multiply 2. (10 points) We will consider a neural network with a slightly unusual structure. Let the input x be d ⇥ 1 and let the weights be represented as k 1 ⇥ d vectors, W (1) , . . . , W ( k ) . Then the final output is ˆ y = k Y i =1 σ ( W ( i ) x ) = σ ( W (1) x ) ⇥ · · · ⇥ σ ( W ( k ) x ) . Define a ( j ) = σ ( W ( j ) x ). (a) What is @ L (ˆ y, y ) / @ a ( j ) for some j ? Since we have not specified the loss function, you can express your answer in terms of @ L (ˆ y, y ) / @ ˆ y . Solution: @ L (ˆ y, y ) @ ˆ y Y i 6 = j σ ( W ( i ) x ) (b) What are the dimensions of @ a ( j ) / @ W ( j ) ? Solution: Because a ( j ) is a scalar, they are the same as for W ( j ) , which is 1 ⇥ d . (c) What is @ a ( j ) / @ W ( j ) ? (Recall that d σ ( v ) /dv = σ ( v )(1 - σ ( v )).) Solution: a ( j ) (1 - a ( j ) ) x T Page 3

Name: (d) What would the form of a stochastic gradient descent update rule be for W ( j ) ? Express your answer in terms of @ L (ˆ y, y ) / @ a ( j ) and @ a ( j ) / @ W ( j ) . Use ⌘ for the step size. Solution: W ( j ) = W ( j ) - ⌘ @ L (ˆ y, y ) @ a ( j ) @ a ( j ) @ W ( j ) Page 4

35OBLEM 25

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

PROBLEM 13

Additional explanation:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

35OBLEM 23

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Name: Autoencoder 4. (14 points) Otto N. Coder is exploring di ↵ erent autoencoder architectures. Consider the fol- lowing autoencoder with input x 2 R d and output y pred 2 R d . The autoencoder has one hidden layer with m hidden units: z (1) , a (1) 2 R m . z (1) = W (1) x + b (1) a (1) = f (1) ( z (1) ) element-wise z (2) = W (2) a (1) + b (2) y pred = f (2) ( z (2) ) element-wise (a) Assume x, z (2) , and y pred have dimensions d ⇥ 1. Also let z (1) and a (1) have dimensions m ⇥ 1. What are the dimensions of the following matrices? W (1) b (1) W (2) b (2) m ⇥ d m ⇥ 1 d ⇥ m d ⇥ 1 Page 10

Name: Otto trains the autoencoder with back-propagation. The loss for a given datapoint x, y is: J ( x, y ) = 1 2 || y pred - y || 2 = 1 2 ( y pred - y ) T ( y pred - y ) . Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of x , y , y pred , W (1) , b (1) , W (2) , b (2) , f (1) , f (2) and any previously computed or provided partial derivative . Also note that: 1. Let @ f (1) / @ z (1) be an m ⇥ 1 matrix, provided to you. 2. Let @ f (2) / @ z (2) be a d ⇥ 1 matrix, provided to you. 3. If Ax = y where A is a m ⇥ n matrix and x is n ⇥ 1 and y is m ⇥ 1, then let @ y/ @ A = x. 4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol ⇤ . (b) Find @ J/ @ y pred , a d ⇥ 1 matrix. Solution: @ J @ y pred = ( y pred - y ) (c) Find @ J/ @ z (2) , a d ⇥ 1 matrix. You may use @ J/ @ y pred and ⇤ for element-wise multipli- cation. Solution: @ J @ z (2) = @ J @ y pred @ y pred @ z (2) = @ J @ y pred ⇤ @ f (2) @ z (2) Check: ( d ⇥ 1) = ( d ⇥ 1) ⇤ ( d ⇥ 1) dimensioned arrays; element-wise multiplication. (d) Find @ J/ @ W (2) , a d ⇥ m matrix. You may use @ J/ @ z (2) . Solution: @ J @ W (2) = @ J @ z (2) @ z (2) @ W (2) T = @ J @ z (2) a (1) T = @ J @ z (2) f (1) ( W (1) x + b (1) ) T Check: ( d ⇥ m ) = ( d ⇥ 1)(1 ⇥ m ) dimensioned arrays. Note that we also accept a (1) even though it was not explicitly provided. Page 11

Name: (e) Write the gradient descent update step for just W (2) for one datapoint ( x, y ) given learning rate ⌘ and @ J/ @ W (2) . Solution: W (2) := W (2) - ⌘ @ J ( x, y ) @ W (2) (f) Otto’s friend Bigsby believes that bigger is better. He takes a look at Otto’s neural network and tells Otto that he should make the number of hidden units m in the hidden layer very large: m = 10 d. (Recall that z (1) has dimensions m ⇥ 1.) Is Bigsby correct? What would you expect to see with training and test accuracy using Bigsby’s approach? Solution: No; training accuracy might be high, but this would likely be due to over- fitting and lead to worse test accuracy. (g) Otto’s other friend Leila says having more layers is better. Let m be much smaller than d . Leila adds 10 more hidden layers all with linear activation before Otto’s current hidden layer (which has sigmoid activation function f (1) ) such that each hidden layer has m units. What would you expect to see with your training and test accuracy, compared to just having one hidden layer with activation f (1) ? Solution: The intermediary hidden layers do not add any expressivity to the network, and we would expect similar training and test accuracy as compared to the single f (1) hidden layer network. This may, however, require di ↵ erent number of training iterations with the same available data, in order to achieve similar accuracy. Page 12

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Name: (h) Another friend Neil suggests to have several layers with non-linear activation function. He says Otto should regularize the number of active hidden units. Loosely speaking, we consider the average activation of a hidden unit j in our hidden layer 1 (which has sigmoid activation function f (1) ) to be the average of the activation of a (1) j over the points x i in our training dataset of size N : ˆ p j = 1 N ⌃ N i =1 a (1) j ( x i ) . Assume we would like to enforce the constraint that the average activation for each hidden unit ˆ p j is close to some hyperparameter p . Usually, p is very small (say p < 0 . 05). What is the best format for a regularization penalty given hyperparameter p and the average activation for all our hidden units: ˆ p j ? Select one of the following: Hinge loss: ⌃ j max (0 , (1 - ˆ p j ) p ) p NLL: ⌃ j ⇣ - p log p ˆ p j - (1 - p ) log (1 - p ) (1 - ˆ p j ) ⌘ p Squared loss: ⌃ j (ˆ p j - p ) 2 l 2 norm: ⌃ j (ˆ p j ) 2 Solution: Either NLL or squared loss should work, encouraging p and ˆ p j to be close. NLL loss might better handle wide range in the magnitudes of ˆ p j . (i) Which pass should Otto compute ˆ p j on? Select one of the following: p Forwards pass Backwards pass Gradient descent step (weight update) pass Page 13

PROBLEM 12

Additional explanation:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Name: Training Neural Networks with Regularization 8. (8 points) In this problem we will investigate regularization for neural networks. Kim constructs a fully connected neural network with L =2 layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of n points { ( x (1) , y (1) ) , ..., ( x ( n ) , y ( n ) ) } . Recall that the update rule for weights W 1 can be specified in terms of step size η and the gradient of the loss function with respect to weights W 1 . This gradient can be expressed in terms of the activations A l , weights W l , pre-activations Z l , and partials ∂L ∂A 2 , ∂A l ∂Z l , for l = 1,2: W 1 := W 1 - η n X i =1 ∂L ( h ( x ( i ) ; W ) , y ( i ) ) ∂W 1 , where h ( · ) is the input-output mapping implemented by the entire neural network, and ∂L ∂W 1 = ∂Z 1 ∂W 1 · ∂A 1 ∂Z 1 · W 2 · ∂A 2 ∂Z 2 · ∂L ∂A 2 . (a) Derive a new update rule for weights W 1 which also penalizes the sum of squared values of all individual weights in the network: L new = L ( h ( x ( i ) ; W ) , y ( i ) ) + λ || W || 2 where λ denotes the regularization trade-off parameter. You can express the new update rule as follows: W 1 := αW 1 - η n X i =1 ∂L ( h ( x ( i ) ; W ) , y ( i ) ) ∂W 1 where L ( · ) represents the previous prediction error loss. What is the value of α in terms of λ and η ? Solution: L new = L + λ X i,j,l ( W l i,j ) 2 ∂L new ∂W 1 = ∂L ∂W 1 + 2 λW 1 W 1 := W 1 - η X ∂L new ∂W 1 W 1 := (1 - 2 λη ) W 1 - η X ∂L ∂W 1 Thus α = 1 - 2 λη . Page 23

Name: (b) Explain how this new update rule helps the neural network reduce overfitting to the data. Solution: For reasonable λ and η , the weights are scaled by a factor less than 1 at each iteration. (If 1 - 2 λη > 1, the weights will rapidly grow and diverge.) A value of | α | < 1 pushes the weights toward zero in general, except those weights that are needed to fit substantial subsets of the data (i.e., those weights that are needed to keep the data loss term L low). (c) Given that we are training a neural network with gradient descent, what happens when we increase the regularization trade-off parameter λ too much, while holding the step size η fixed? Solution: With too large a λ , α may approach zero and the weights would be too strongly penalized and thus tend to zero, preventing the neural network from fitting the available training data. That is to say, the network is pushed towards an overly “generalized” constant output based on zero or near-zero weights. With even larger values of λ , α may become negative causing oscillations in weights. With | α | larger than 1, the weights will grow in magnitude without bound. Page 24

PROBLEM 7

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

change from CHARGE because, even though both CHARGE and EXPLORE appear to cost the same -10 in the table, performing a CHARGE leads to higher reward states. c) Recall that V ( s ) = max a Q ( s, a ). The possible s 0 in this model for s =EMPTY are EMPTY and HIGH. We determine Q 2 ( EMPTY, EXPLORE ) and Q 2 ( EMPTY, CHARGE ) and take the max of the two to be V 2 ( EMPTY ). Q 2 ( EMPTY, EXPLORE ) = - 10 + 0 . 5 * V 1 ( EMPTY ) = - 15 Q 2 ( EMPTY, CHARGE ) = - 10 + 0 . 5 * V 1 ( HIGH ) = - 9 So V 2 ( EMPTY ) = - 9. d) Solutions are fine. 7) a) Recall that the normal vector to a hyperplane is defined as the θ parameter. The hidden sign units in this network perform the same kind linear combination with an offset that our signed hyperplanes use. b) Using the fact that the hidden units act as linear classifiers, we can draw decision boundaries that separate the signed points (orientation is important). Multiple solutions. c) Solutions are fine. d) Solutions are fine. 8) a) Solutions are fine. b) We get the optimal policy from the largest Q 1 ( s, a )-values. c) We can get the V 1 ( s ) values from the largest Q ( s, a ) values. 9) a) Recall that the margin of a dataset with respect to a separator is the minimum margin of all points in the dataset with respect to the separator. Removing a single point can result in the margin of the dataset increasing (if the removed point had the minimum margin) or staying the same (if it did not). b) If there were a separator for the dataset with some of the coordinates omitted, then the same separator (with 0 weights for the omitted coordinates) would still separate the original dataset. c) The setup of this problem is unusual because we are given a fixed set of classifiers H = { h 1 , h 2 , h 3 } and select one based on a training set error. The parameters of the classifiers do not actually depend on the training set, and the test set error is fixed for a particular choice of classifier. We are also told the relative test errors are E ( h 1 ) < E ( h 2 ) < E ( h 3 ). The point of the problem is to see how the function we choose given a training set changes with the number of training points. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Documents

CPAN112_Lab02.docx

CPAN112_ProgramAssignment_CostVolumeProfitAnalysis_BreakEven.docx

CPAN112_Lab04.docx

CECS 229 Programming Assignment #1.pdf

Discussion 2 Switching.docx

Homework 2 - Solution -in class problem.docx

CapturetheFlagCTFWrite-Up.docx

Class 10 Prep (1).Rmd

CST8116 - Assignment #2.doc

Capstone_7_1.docx

Recommended textbooks for you

Text book image

Operations Research : Applications and Algorithms

Computer Science

ISBN:9780534380588

Author:Wayne L. Winston

Publisher:Brooks Cole

Text book image

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Text book image

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

Text book image

Computer Science

ISBN:9781337681919

Author:BIDGOLI

Publisher:Cengage

Text book image

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Text book image

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

SEE MORE TEXTBOOKS

Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
MIS
Computer Science
ISBN:9781337681919
Author:BIDGOLI
Publisher:Cengage
Principles of Information Systems (MindTap Course...
Computer Science
ISBN:9781285867168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

Text book image

Operations Research : Applications and Algorithms

Computer Science

ISBN:9780534380588

Author:Wayne L. Winston

Publisher:Brooks Cole

Text book image

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Text book image

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

Text book image

Computer Science

ISBN:9781337681919

Author:BIDGOLI

Publisher:Cengage

Text book image

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Text book image

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

SEE MORE TEXTBOOKS