final-nn-sols

.pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

7

Uploaded by lixun73230

Report
CS 188 Summer 2023 Final Review Neural Networks Solutions Q1. Neural Networks: Representation G 1 : y * x * w 2 w 1 H 1 : y w 11 w 12 * x w 21 w 22 * G 5 : y * + relu x * + b 2 w 2 b 1 w 1 H 5 : y w 11 w 12 * + relu x w 21 w 22 * + b 2 b 11 b 12 G 3 : y * relu x * w 2 w 1 H 3 : y w 11 w 12 * relu x w 21 w 22 * G 4 : y * + relu x * w 2 b 1 w 1 H 4 : y w 11 w 12 * + relu x w 21 w 22 * b 11 b 12 G 2 : y * + x * w 2 b 1 w 1 H 2 : y w 11 w 12 * + x w 21 w 22 * b 11 b 12 For each of the piecewise-linear functions below, mark all networks from the list above that can represent the function exactly on the range x ( −∞ , ). In the networks above, relu denotes the element-wise ReLU nonlinearity: relu ( z ) = max (0 , z ). The networks G i use 1-dimensional layers, while the networks H i have some 2-dimensional intermediate layers. (a) G 1 G 2 G 3 G 4 G 5 *$ None of the above H 1 H 2 H 3 H 4 H 5 G 1 G 2 G 3 G 4 G 5 *$ None of the above H 1 H 2 H 3 H 4 H 5 The networks G 3 , G 4 , G 5 include a ReLU nonlinearity on a scalar quantity, so it is impossible for their output to represent a non-horizontal straight line. On the other hand, H 3 , H 4 , H 5 have a 2-dimensional hidden layer, which allows two ReLU elements facing in opposite directions to be added together to form a straight line. The second subpart requires a bias term because the line does not pass through the origin. 1
(b) G 1 G 2 G 3 G 4 G 5 *$ None of the above H 1 H 2 H 3 H 4 H 5 G 1 G 2 G 3 G 4 G 5 '! None of the above H 1 H 2 H 3 H 4 H 5 These functions include multiple non-horizontal linear regions, so they cannot be represented by any of the networks G i which apply ReLU no more than once to a scalar quantity. The first subpart can be represented by any of the networks with 2-dimensional ReLU nodes. The point of nonlinearity occurs at the origin, so nonzero bias terms are not required. The second subpart has 3 points where the slope changes, but the networks H i only have a single 2- dimensional ReLU node. Each application of ReLU to one element can only introduce a change of slope for a single value of x . (c) G 1 G 2 G 3 G 4 G 5 *$ None of the above H 1 H 2 H 3 H 4 H 5 G 1 G 2 G 3 G 4 G 5 *$ None of the above H 1 H 2 H 3 H 4 H 5 Both functions have two points where the slope changes, so none of the networks G i ; H 1 , H 2 can represent them. An output bias term is required for the first subpart because one of the flat regions must be generated by the flat part of a ReLU function, but neither one of them is at y = 0. The second subpart doesn’t require a bias term at the output: it can be represented as relu ( x +1 2 ) relu ( x + 1). Note how if the segment at x > 2 were to be extended to cross the x axis, it would cross exactly at x = 1, the location of the other slope change. A similar statement is true for the segment at x < 1. 2
Q2. Deep “Blackjack” To celebrate the end of the semester, you visit Las Vegas and decide to play a good, old fashioned game of “Blackjack”! Recall that the game has states 0,1,. . . ,8, corresponding to dollar amounts, and a Done state where the game ends. The player starts with $ 2, i.e. at state 2. The player has two actions: Stop ( a = 0) and Roll ( a = 1), and is forced to take the Stop action at states 0,1,and 8. When the player takes the Stop action ( a = 0), they transition to the Done state and receive reward equal to the amount of dollars of the state they transitioned from: e.g. taking the stop action at state 3 gives the player $ 3. The game ends when the player transitions to Done . The Roll action ( a = 1) is available from states 2-7. The player rolls a biased 6-sided die. If the player Rolls from state s and the die lands on outcome o , the player transitions to state s + o 2, as long as s + o 2 8 ( s is the amount of dollars of the current state, o is the amount rolled, and the negative 2 is the price to roll). If s + o 2 > 8, the player busts, i.e. transitions to Done and does NOT receive reward. As the bias of the dice is unknown , you decided to perform some good-old fashioned reinforcement learning (RL) to solve the game. However, unlike in the midterm, you have decided to flex and solve the game using approximate Q-learning. Not only that, you decided not to design any features - the features for the Q-value at ( s, a ) will simply be the vector [ s a ], where s is the state and a is the action. (a) First, we will investigate how your choice of features impacts whether or not you can learn the optimal policy. Suppose the unique optimal policy in the MDP is the following: State 2 3 4 5 6 7 π ( s ) Roll Roll Roll Stop Stop Stop For each of the cases below, select “Possible with large neural net” if the policy can be expressed by using a large neural net to represent the Q-function using the features specified as input. (That is, the greedy policy with respect to some Q-function representable with a large neural network is the optimal policy: Q ( s, π ( s )) > Q ( s, a ) for all states s and actions a ̸ = π ( s ).) Select “Possible with weighted sum” if the policy can be expressed by using a weighted linear sum to represent the Q-function. Select “Not Possible” if expressing the policy with given features is impossible no matter the function. (i) Suppose we decide to use the state s and action a as the features for Q ( s, a ). Possible with large neural network Possible with linear weighted sum of features *$ Not possible A sufficiently large neural network could represent the true optimal Q -function using this feature repre- sentation. The optimal Q -function satisfies the desired property (there are no ties as the optimal policy is unique). Alternatively, a sufficiently large neural network could represent a function that is 1 for the optimal action in each state, and 0 otherwise, which also suffices. No linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be w 0 s + w 1 a + w 3 . We need Q (4 , 1) > Q (4 , 0) and Q (5 , 0) > Q (5 , 1). Plugging in the expression for Q, the former inequality requires that w 1 > 0. The second inequality requires that w 1 < 0, a contradiction. So we cannot represent the policy with a weighted sum of features. (ii) Now suppose we decide to use s + a as the feature for Q ( s, a ). Possible with large neural network Possible with linear weighted sum of features *$ Not possible 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Indeed, it’s possible that no neural network can represent the optimal Q -function with this feature repre- sentation, as Q (4 , 1) does not have to equal Q (5 , 0). However, the question is not asking about representing the optimal Q -function, but instead the optimal policy, which merely requires that Q ( s, 1) > Q ( s, 0) for s 4 and Q ( s, 0) > Q ( s, 1) for s 5. This can be done with the feature representation. For example the neural network in part (d) can represent the following function (using w 0 = 5 and w 1 = 2) that represents the optimal policy: A Q-function from s + a , that represents the optimal policy. Again, no linear weighted sum of features can represent this optimal policy. To see this, let our linear weighted sum be w 0 s + w 0 a + w 1 . This is a special case of the linear weighted sums in part (i), which we know cannot represent the optimal policy. (iii) Now suppose we decide to use a as the feature for Q ( s, a ). Possible with large neural network Possible with linear weighted sum of features '! Not possible This isn’t possible, regardless of what function you use. With this representation, we will have Q (4 , 1) = Q (5 , 1) and Q (4 , 0) = Q (5 , 0). So we cannot both have Q (4 , 1) > Q (4 , 0) and Q (5 , 0) > Q (5 , 1). (iv) Now suppose we decide to use sign( s 4 . 5) · a as the feature for Q ( s, a ), where sign( x ) is 1 if x < 0, 1 if x > 0, and 0 if x = 0. Possible with large neural network Possible with linear weighted sum of features *$ Not possible Q ( s, a ) = sign( s 4 . 5) · a is sufficient to represent the optimal policy, as we have Q ( s, 0) = 0 for all s , Q ( s, 1) = 1 > Q ( s, 0) for s 4, and Q ( s, 1) = 1 < Q ( s, 0) for s 5. This is a linear function of the input, and so can also be represented using a neural network. (b) Next, we investigate the effect of different neural network architectures on your ability to learn the optimal policy. Recall that our features for the Q-value at ( s, a ) will simply be the vector [ s a ], where s is the state and a is the action. In addition, suppose that the unique optimal policy is the following: State 2 3 4 5 6 7 π ( s ) Roll Roll Roll Stop Stop Stop Which of the following neural network architectures can express Q-values that represent the optimal policy? That is, the greedy policy with respect to some Q-function representable with the given neural 4
network is the optimal policy: Q ( s, π ( s )) > Q ( s, a ) for all states s and actions a ̸ = π ( s ). Hint: Recall that ReLU ( x ) = ( x x > 0 0 x 0 Neural Network 1: Neural Network 2: Neural Network 3: Neural Network 4: Neural Network 5: 5
*$ None of the above. Recall from the previous question that no linear function of the features [ s a ] can represent the optimal policy. So network 1, which is linear (as it has no activation function), cannot represent the optimal policy. Network 2 cannot represent the optimal function, as it does not take as input the action. So Q ( s, 0) = Q ( s, 1) for all states s . Network 3 cannot simultaneously satisfy Q (4 , 0) < Q (4 , 1) and Q (5 , 0) > Q (5 , 1). This is because the rectified linear unit is a monotonic function: if x y , then ReLU ( x ) ReLU ( y ). Since we cannot represent the optimal policy using a linear function of s, a , we cannot represent it with a ReLU of a linear function of s, a . Network 4 is always 0, so it cannot represent the (unique) optimal policy. Network 5 can represent the optimal policy. For example, w 0 = 4, w 1 = 2 represents the optimal policy. (c) As with the linear approximate q-learning, you decide to minimize the squared error of the Bellman resid- ual. Let Q w ( s, a ) be the approximate Q -values of s, a . After taking action a in state s and transitioning to state s with reward r , you first compute the target target = r + γ max a Q w ( s , a ). Then your loss is: loss( w ) = 1 2 ( Q w ( s, a ) target) 2 You then perform gradient descent to minimize this loss. Note that we will not take the gradient through the target - we treat it as a fixed value. Which of the following updates represents one step of gradient descent on the weight parameter w i with learning rate α (0 , 1) after taking action a in state s and transitioning to state s with reward r ? [Hint: which of these is equivalent to the normal approximate Q-learning update when Q w , ( s, a ) = w · f ( s, a )?] *$ w i = w i + α ( Q w ( s, a ) ( r + γ max a Q w ( s , a ))) ∂Q w ( s,a ) ∂w i '! w i = w i α ( Q w ( s, a ) ( r + γ max a Q w ( s , a ))) ∂Q w ( s,a ) ∂w i *$ w i = w i + α ( Q w ( s, a ) ( r + γ max a Q w ( s , a ))) s *$ w i = w i α ( Q w ( s, a ) ( r + γ max a Q w ( s , a ))) s *$ None of the above. Note that the gradient of the loss with respect to the parameter, via the chain rule, is: Q w ( s, a ) r + γ max a Q w ( s , a ) ∂Q w ( s, a ) ∂w i The second option performs gradient descent, the first option is gradient ascent, and the other options compute the gradient incorrectly. (d) While programming the neural network, you’re getting some bizarre errors. To debug these, you decide to calculate the gradients by hand and compare them to the result of your code. Suppose your neural network is the following: That is, Q w ( s, a ) = s + a + w 1 ReLU ( w 0 + s + a ). 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Neural Network 6 You are able to recall that d dx ReLU ( x ) = ( 1 x 0 0 x < 0 . (i) Suppose w 0 = 4, and w 1 = 1. What is Q w (5 , 0)? Q w (5 , 0) = 4 Plugging in values, we get Q w (5 , 0) = 5 ReLU ( 4 + 5) = 4 . (ii) Suppose w 0 = 4, and w 1 = 1. What is the gradient with respect to w 0 , evaluated at s = 5 , a = 0? w 0 Q w (5 , 0) = -1 Since the input to the ReLU is positive, the gradient of the ReLU is 1. Applying the chain rule, we get that w 0 Q w (5 , 0) = w 1 = 1 (iii) Suppose w 0 = 4, and w 1 = 1. What is the gradient with respect to w 0 , evaluated at s = 3 , a = 0? w 0 Q w (3 , 0) = 0 Since the input to the ReLU is negative, the gradient of the ReLU is 0. Applying the chain rule, the gradient with respect to w 0 has to be 0. (e) After picking a feature representation, neural network architecture, and update rule, as well as calculating the gradients, it’s time to turn to the age old question... will this even work? (i) Without any other assumptions, is it guaranteed that your approximate Q -values will converge to the optimal policy, if each s, a pair is observed an infinite amount of times? *$ Yes '! No (ii) Without any other assumptions, is it guaranteed that your approximate Q -values will converge to the optimal policy, if each s, a pair is observed an infinite amount of times and there exists some w such that Q w ( s, a ) = Q ( s, a )? *$ Yes '! No Note that there’s no guarantee that your neural network will converge in this case. For example, the learning rate can be too large! (As in the RL Blackjack question on midterm.) 7