disc08-regular-sols

pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

5

Uploaded by lixun73230

Report
CS 188 Spring 2023 Regular Discussion 8 Solutions 1 Pacman with Feature-Based Q-Learning We would like to use a Q-learning agent for Pacman, but the size of the state space for a large grid is too massive to hold in memory. To solve this, we will switch to feature-based representation of Pacman’s state. (a) We will have two features, F g and F p , defined as follows: F g ( s, a ) = A ( s ) + B ( s, a ) + C ( s, a ) F p ( s, a ) = D ( s ) + 2 E ( s, a ) where A ( s ) = number of ghosts within 1 step of state s B ( s, a ) = number of ghosts Pacman touches after taking action a from state s C ( s, a ) = number of ghosts within 1 step of the state Pacman ends up in after taking action a D ( s ) = number of food pellets within 1 step of state s E ( s, a ) = number of food pellets eaten after taking action a from state s For this pacman board, the ghosts will always be stationary, and the action space is { left, right, up, down, stay } . calculate the features for the actions ∈ { left, right, up, stay } F p ( s, up ) = 1 + 2(1) = 3 F p ( s, left ) = 1 + 2(0) = 1 F p ( s, right ) = 1 + 2(0) = 1 F p ( s, stay ) = 1 + 2(0) = 1 F g ( s, up ) = 2 + 0 + 0 = 2 F g ( s, left ) = 2 + 1 + 1 = 4 F g ( s, right ) = 2 + 1 + 1 = 4 F g ( s, stay ) = 2 + 0 + 2 = 4 1
(b) After a few episodes of Q-learning, the weights are w g = 10 and w p = 100. Calculate the Q value for each action ∈ { left, right, up, stay } from the current state shown in the figure. Q ( s, up ) = w p F p ( s, up ) + w g F g ( s, up ) = 100(3) + ( 10)(2) = 280 Q ( s, left ) = w p F p ( s, left ) + w g F g ( s, left ) = 100(1) + ( 10)(4) = 60 Q ( s, right ) = w p F p ( s, right ) + w g F g ( s, right ) = 100(1) + ( 10)(4) = 60 Q ( s, stay ) = w p F p ( s, stay ) + w g F g ( s, stay ) = 100(1) + ( 10)(4) = 60 (c) We observe a transition that starts from the state above, s , takes action up , ends in state s (the state with the food pellet above) and receives a reward R ( s, a, s ) = 250. The available actions from state s are down and stay . Assuming a discount of γ = 0 . 5, calculate the new estimate of the Q value for s based on this episode. Q new ( s, a ) = R ( s, a, s ) + γ max a Q ( s , a ) = 250 + 0 . 5 max { Q ( s , down ) , Q ( s , stay ) } = 250 + 0 . 5 0 = 250 where Q ( s , down ) = w p F p ( s, down ) + w g F g ( s, down ) = 100(0) + ( 10)(2) = 20 Q ( s , stay ) = w p F p ( s, stay ) + w g F g ( s, stay ) = 100(0) + ( 10)(0) = 0 (d) With this new estimate and a learning rate ( α ) of 0.5, update the weights for each feature. w p = w p + α ( Q new ( s, a ) Q ( s, a )) F p ( s, a ) = 100 + 0 . 5 (250 280) 3 = 55 w g = w g + α ( Q new ( s, a ) Q ( s, a )) F g ( s, a ) = 10 + 0 . 5 (250 280) 2 = 40 2
2 Q-learning Consider the following gridworld (rewards shown on left, state names shown on right). Rewards State names From state A, the possible actions are right( ) and down( ). From state B, the possible actions are left( ) and down( ). For a numbered state (G1, G2), the only action is to exit. Upon exiting from a numbered square we collect the reward specified by the number on the square and enter the end-of-game absorbing state X . We also know that the discount factor γ = 1, and in this MDP all actions are deterministic and always succeed. Consider the following episodes: Episode 1 ( E 1 ) s a s r A G 1 0 G 1 exit X 10 Episode 2 ( E 2 ) s a s r B G 2 0 G 2 exit X 1 Episode 3 ( E 3 ) s a s r A B 0 B G 2 0 G 2 exit X 1 Episode 4 ( E 4 ) s a s r B A 0 A G 1 0 G 1 exit X 10 (a) Consider using temporal-difference learning to learn V ( s ). When running TD-learning, all values are ini- tialized to zero. For which sequences of episodes, if repeated infinitely often, does V ( s ) converge to V ( s ) for all states s ? (Assume appropriate learning rates such that all values converge.) Write the correct sequence under “Other” if no correct sequences of episodes are listed. E 1 , E 2 , E 3 , E 4 E 1 , E 2 , E 1 , E 2 E 1 , E 2 , E 3 , E 1 E 4 , E 4 , E 4 , E 4 E 4 , E 3 , E 2 , E 1 E 3 , E 4 , E 3 , E 4 E 1 , E 2 , E 4 , E 1 Other See explanation below TD learning learns the value of the executed policy, which is V π ( s ). Therefore for V π ( s ) to converge to V ( s ), it is necessary that the executing policy π ( s ) = π ( s ). Because there is no discounting since γ = 1, the optimal deterministic policy is π ( A ) = and π ( B ) = ( π ( G 1) and π ( G 2) are trivially exit because that is the only available action). Therefore episodes E 1 and E 4 act according to π ( s ) while episodes E 2 and E 3 are sampled from a suboptimal policy. From the above, TD learning using episode E 4 (and optionally E 1) will converge to V π ( s ) = V ( s ) for states A , B , G 1. However, then we never visit G 2, so V ( G 2) will never converge. If we add either episode E 2 or E 3 to ensure that V ( G 2) converges, then we are executing a suboptimal policy, which will then cause V ( B ) to not converge. Therefore none of the listed sequences will learn a value function V π ( s ) that converges to V ( s ) for all states s . An example of a correct sequence would be E 2 , E 4 , E 4 , E 4 , ... ; sampling 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
E 2 first with the learning rate α = 1 ensures V π ( G 2) = V ( G 2), and then executing E 4 infinitely after ensures the values for states A , B , and G 1 converge to the optimal values. We also accepted the answer such that the value function V ( s ) converges to V ( s ) for states A and B (ignoring G 1 and G 2). TD learning using only episode E 4 (and optionally E 1) will converge to V π ( s ) = V ( s ) for states A and B , therefore the only correct listed option is E 4 , E 4 , E 4 , E 4. 4
(b) Consider using Q-learning to learn Q ( s, a ). When running Q-learning, all values are initialized to zero. For which sequences of episodes, if repeated infinitely often, does Q ( s, a ) converge to Q ( s, a ) for all state- action pairs ( s, a ) (Assume appropriate learning rates such that all Q-values converge.) Write the correct sequence under “Other” if no correct sequences of episodes are listed. E 1 , E 2 , E 3 , E 4 E 1 , E 2 , E 1 , E 2 E 1 , E 2 , E 3 , E 1 E 4 , E 4 , E 4 , E 4 E 4 , E 3 , E 2 , E 1 E 3 , E 4 , E 3 , E 4 E 1 , E 2 , E 4 , E 1 Other For Q ( s, a ) to converge, we must visit all state action pairs for non-zero Q ( s, a ) infinitely often. Therefore we must take the exit action in states G 1 and G 2, must take the down and right action in state A , and must take the left and down action in state B . Therefore the answers must include E 3 and E 4. 5