disc18-regular

pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

4

Uploaded by lixun73230

Report
CS 188 Summer 2023 Discussion 6A Q1. RL: Amusement Park After the disastrous waterslide experience you decide to go to an amusement park instead. In the previous questions the MDP was based on a single ride (a water slide). Here our MDP is about choosing a ride from a set of many rides. You start off feeling well, getting positive rewards from rides, some larger than others. However, there is some chance of each ride making you sick. If you continue going on rides while sick there is some chance of becoming well again, but you don’t enjoy the rides as much, receiving lower rewards (possibly negative). You have never been to an amusement park before, so you don’t know how much reward you will get from each ride (while well or sick). You also don’t know how likely you are to get sick on each ride, or how likely you are to become well again. What you do know about the rides is: Actions / Rides Type Wait Speed Big Dipper Rollercoaster Long Fast Wild Mouse Rollercoaster Short Slow Hair Raiser Drop tower Short Fast Moon Ranger Pendulum Short Slow Leave the Park Leave Short Slow We will formulate this as an MDP with two states, well and sick. Each ride corresponds to an action. The ’Leave the Park’ action ends the current run through the MDP. Taking a ride will lead back to the same state with some probability or take you to the other state. We will use a feature based approximation to the Q-values, defined by the following four features and associated weights: Features Initial Weights f 0 ( state , action ) = 1 (this is a bias feature that is always 1) w 0 = 1 f 1 ( state , action ) = 1 if action type is Rollercoaster 0 otherwise w 1 = 2 f 2 ( state , action ) = 1 if action wait is Short 0 otherwise w 2 = 1 f 3 ( state , action ) = 1 if action speed is Fast 0 otherwise w 3 = 0 . 5 (a) Calculate Q( ’Well’ , ’Big Dipper’ ): 1
(b) Apply a Q-learning update based on the sample ( ’Well’ , ’Big Dipper’ , ’Sick’ , 10 . 5), using a learning rate of α = 0 . 5 and discount of γ = 0 . 5. What are the new weights? w 0 = w 1 = w 2 = w 3 = 2
(c) Using our approximation, are the Q-values that involve the sick state the same or different from the corre- sponding Q-values that involve the well state? In other words, is Q( ’Well’ , action) = Q( ’Sick’ , action) for each possible action? Why / Why not? (in just one sentence) Same Different Now we will consider the exploration / exploitation tradeoff in this amusement park. (d) Assume we have the original weights from the table on the previous page. What action will an ϵ -greedy approach choose from the well state? If multiple actions could be chosen, give each action and its probability. (e) When running Q-learning another approach to dealing with this tradeoff is using an exploration function: f ( u, n ) = u + k n (i) How is this function used in the Q-learning equations? (a single sentence) What are each of the following? (a single sentence each) (ii) u : (iii) n : (iv) k : 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
2 Deep inside Q-learning Consider the grid-world given below and an agent who is trying to learn the optimal policy. Rewards are only awarded for taking the Exit action from one of the shaded states. Taking this action moves the agent to the Done state, and the MDP terminates. Assume γ = 1 and α = 0 . 5 for all calculations. All equations need to explicitly mention γ and α if necessary. (a) The agent starts from the top left corner and you are given the following episodes from runs of the agent through this grid-world. Each line in an Episode is a tuple containing ( s, a, s , r ). Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (2,2), E, (3,2), 0 (2,2), S, (2,1), 0 (2,2), E, (3,2), 0 (2,2), E, (3,2), 0 (2,2), E, (3,2), 0 (3,2), N, (3,3), 0 (2,1), Exit, D, -100 (3,2), S, (3,1), 0 (3,2), N, (3,3), 0 (3,2), S, (3,1), 0 (3,3), Exit, D, +50 (3,1), Exit, D, +30 (3,3), Exit, D, +50 (3,1), Exit, D, +30 Fill in the following Q-values obtained from direct evaluation from the samples: Q ((3,2), N) = Q ((3,2), S) = Q ((2,2), E) = (b) Q-learning is an online algorithm to learn optimal Q-values in an MDP with unknown rewards and transition function. The update equation is: Q ( s t , a t ) = (1 α ) Q ( s t , a t ) + α ( r t + γ max a Q ( s t +1 , a )) where γ is the discount factor, α is the learning rate and the sequence of observations are ( · · · , s t , a t , s t +1 , r t , · · · ). Given the episodes in (a), fill in the time at which the following Q values first become non-zero. When up- dating the Q values, You should only go through each transition once, and the order in which you are to go through them is: transitions in ep 1, transitions in ep 2 and so on. Your answer should be of the form ( episode#,iter# ) where iter# is the Q-learning update iteration in that episode. If the specified Q value never becomes non-zero, write never . Q ((1,2), E) = Q ((2,2), E) = Q ((3,2), S) = (c) In Q-learning, we look at a window of ( s t , a t , s t +1 , r t ) to update our Q-values. One can think of using an update rule that uses a larger window to update these values. Give an update rule for Q ( s t , a t ) given the window ( s t , a t , r t , s t +1 , a t +1 , r t +1 , s t +2 ). Q ( s t , a t ) = Q ( s t , a t ) = Q ( s t , a t ) = 4