disc08-regular-sols

pdf

School

Hong Kong Polytechnic University *

*We aren’t endorsed by this school

Course

273

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by lixun73230

CS 188 Spring 2023 Regular Discussion 8 Solutions 1 Pacman with Feature-Based Q-Learning We would like to use a Q-learning agent for Pacman, but the size of the state space for a large grid is too massive to hold in memory. To solve this, we will switch to feature-based representation of Pacman’s state. (a) We will have two features, F g and F p , defined as follows: F g ( s, a ) = A ( s ) + B ( s, a ) + C ( s, a ) F p ( s, a ) = D ( s ) + 2 E ( s, a ) where A ( s ) = number of ghosts within 1 step of state s B ( s, a ) = number of ghosts Pacman touches after taking action a from state s C ( s, a ) = number of ghosts within 1 step of the state Pacman ends up in after taking action a D ( s ) = number of food pellets within 1 step of state s E ( s, a ) = number of food pellets eaten after taking action a from state s For this pacman board, the ghosts will always be stationary, and the action space is { left, right, up, down, stay } . calculate the features for the actions ∈ { left, right, up, stay } F p ( s, up ) = 1 + 2(1) = 3 F p ( s, left ) = 1 + 2(0) = 1 F p ( s, right ) = 1 + 2(0) = 1 F p ( s, stay ) = 1 + 2(0) = 1 F g ( s, up ) = 2 + 0 + 0 = 2 F g ( s, left ) = 2 + 1 + 1 = 4 F g ( s, right ) = 2 + 1 + 1 = 4 F g ( s, stay ) = 2 + 0 + 2 = 4 1

(b) After a few episodes of Q-learning, the weights are w g = − 10 and w p = 100. Calculate the Q value for each action ∈ { left, right, up, stay } from the current state shown in the figure. Q ( s, up ) = w p F p ( s, up ) + w g F g ( s, up ) = 100(3) + ( − 10)(2) = 280 Q ( s, left ) = w p F p ( s, left ) + w g F g ( s, left ) = 100(1) + ( − 10)(4) = 60 Q ( s, right ) = w p F p ( s, right ) + w g F g ( s, right ) = 100(1) + ( − 10)(4) = 60 Q ( s, stay ) = w p F p ( s, stay ) + w g F g ( s, stay ) = 100(1) + ( − 10)(4) = 60 (c) We observe a transition that starts from the state above, s , takes action up , ends in state s ′ (the state with the food pellet above) and receives a reward R ( s, a, s ′ ) = 250. The available actions from state s ′ are down and stay . Assuming a discount of γ = 0 . 5, calculate the new estimate of the Q value for s based on this episode. Q new ( s, a ) = R ( s, a, s ′ ) + γ ∗ max a ′ Q ( s ′ , a ′ ) = 250 + 0 . 5 ∗ max { Q ( s ′ , down ) , Q ( s ′ , stay ) } = 250 + 0 . 5 ∗ 0 = 250 where Q ( s ′ , down ) = w p F p ( s, down ) + w g F g ( s, down ) = 100(0) + ( − 10)(2) = − 20 Q ( s ′ , stay ) = w p F p ( s, stay ) + w g F g ( s, stay ) = 100(0) + ( − 10)(0) = 0 (d) With this new estimate and a learning rate ( α ) of 0.5, update the weights for each feature. w p = w p + α ∗ ( Q new ( s, a ) − Q ( s, a )) ∗ F p ( s, a ) = 100 + 0 . 5 ∗ (250 − 280) ∗ 3 = 55 w g = w g + α ∗ ( Q new ( s, a ) − Q ( s, a )) ∗ F g ( s, a ) = − 10 + 0 . 5 ∗ (250 − 280) ∗ 2 = − 40 2

2 Q-learning Consider the following gridworld (rewards shown on left, state names shown on right). Rewards State names From state A, the possible actions are right( → ) and down( ↓ ). From state B, the possible actions are left( ← ) and down( ↓ ). For a numbered state (G1, G2), the only action is to exit. Upon exiting from a numbered square we collect the reward specified by the number on the square and enter the end-of-game absorbing state X . We also know that the discount factor γ = 1, and in this MDP all actions are deterministic and always succeed. Consider the following episodes: Episode 1 ( E 1 ) s a s ′ r A ↓ G 1 0 G 1 exit X 10 Episode 2 ( E 2 ) s a s ′ r B ↓ G 2 0 G 2 exit X 1 Episode 3 ( E 3 ) s a s ′ r A → B 0 B ↓ G 2 0 G 2 exit X 1 Episode 4 ( E 4 ) s a s ′ r B ← A 0 A ↓ G 1 0 G 1 exit X 10 (a) Consider using temporal-difference learning to learn V ( s ). When running TD-learning, all values are ini- tialized to zero. For which sequences of episodes, if repeated infinitely often, does V ( s ) converge to V ∗ ( s ) for all states s ? (Assume appropriate learning rates such that all values converge.) Write the correct sequence under “Other” if no correct sequences of episodes are listed. □ E 1 , E 2 , E 3 , E 4 □ E 1 , E 2 , E 1 , E 2 □ E 1 , E 2 , E 3 , E 1 ■ E 4 , E 4 , E 4 , E 4 □ E 4 , E 3 , E 2 , E 1 □ E 3 , E 4 , E 3 , E 4 □ E 1 , E 2 , E 4 , E 1 ■ Other See explanation below TD learning learns the value of the executed policy, which is V π ( s ). Therefore for V π ( s ) to converge to V ∗ ( s ), it is necessary that the executing policy π ( s ) = π ∗ ( s ). Because there is no discounting since γ = 1, the optimal deterministic policy is π ∗ ( A ) = ↓ and π ∗ ( B ) = ← ( π ∗ ( G 1) and π ∗ ( G 2) are trivially exit because that is the only available action). Therefore episodes E 1 and E 4 act according to π ∗ ( s ) while episodes E 2 and E 3 are sampled from a suboptimal policy. From the above, TD learning using episode E 4 (and optionally E 1) will converge to V π ( s ) = V ∗ ( s ) for states A , B , G 1. However, then we never visit G 2, so V ( G 2) will never converge. If we add either episode E 2 or E 3 to ensure that V ( G 2) converges, then we are executing a suboptimal policy, which will then cause V ( B ) to not converge. Therefore none of the listed sequences will learn a value function V π ( s ) that converges to V ∗ ( s ) for all states s . An example of a correct sequence would be E 2 , E 4 , E 4 , E 4 , ... ; sampling 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

E 2 first with the learning rate α = 1 ensures V π ( G 2) = V ∗ ( G 2), and then executing E 4 infinitely after ensures the values for states A , B , and G 1 converge to the optimal values. We also accepted the answer such that the value function V ( s ) converges to V ∗ ( s ) for states A and B (ignoring G 1 and G 2). TD learning using only episode E 4 (and optionally E 1) will converge to V π ( s ) = V ∗ ( s ) for states A and B , therefore the only correct listed option is E 4 , E 4 , E 4 , E 4. 4

(b) Consider using Q-learning to learn Q ( s, a ). When running Q-learning, all values are initialized to zero. For which sequences of episodes, if repeated infinitely often, does Q ( s, a ) converge to Q ∗ ( s, a ) for all state- action pairs ( s, a ) (Assume appropriate learning rates such that all Q-values converge.) Write the correct sequence under “Other” if no correct sequences of episodes are listed. ■ E 1 , E 2 , E 3 , E 4 □ E 1 , E 2 , E 1 , E 2 □ E 1 , E 2 , E 3 , E 1 □ E 4 , E 4 , E 4 , E 4 ■ E 4 , E 3 , E 2 , E 1 ■ E 3 , E 4 , E 3 , E 4 □ E 1 , E 2 , E 4 , E 1 □ Other For Q ( s, a ) to converge, we must visit all state action pairs for non-zero Q ∗ ( s, a ) infinitely often. Therefore we must take the exit action in states G 1 and G 2, must take the down and right action in state A , and must take the left and down action in state B . Therefore the answers must include E 3 and E 4. 5

Related Documents

05-02_task1.pdf

FINAL ARTICULATE PROJECT.docx

CS4248_Tutorial_6_ws.pdf

KeyF10.pdf

Assessment 5_Unique number 302211 (page 5 of 10).pdf

L01 Check Your Understanding 1: [23FA] CAS 100A, Sec 02: Eff Speech (WC, Bedell).pdf

Chat_histor3.docx

hw3.cpp

MortgageCalculator.java

IMG_2736.jpeg

Momaliche COMP M.S. P1.docx

Assignment 011 - (7).png

Recommended textbooks for you

Operations Research : Applications and Algorithms

Computer Science

ISBN:9780534380588

Author:Wayne L. Winston

Publisher:Brooks Cole

C++ Programming: From Problem Analysis to Program...

Computer Science

ISBN:9781337102087

Author:D. S. Malik

Publisher:Cengage Learning

C++ for Engineers and Scientists

Computer Science

ISBN:9781133187844

Author:Bronson, Gary J.

Publisher:Course Technology Ptr

Programming Logic & Design Comprehensive

Computer Science

ISBN:9781337669405

Author:FARRELL

Publisher:Cengage

LINUX+ AND LPIC-1 GDE.TO LINUX CERTIF.

Computer Science

ISBN:9781337569798

Author:ECKERT

Publisher:CENGAGE L

CMPTR

Computer Science

ISBN:9781337681872

Author:PINARD

Publisher:Cengage

SEE MORE TEXTBOOKS

Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning
C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
LINUX+ AND LPIC-1 GDE.TO LINUX CERTIF.
Computer Science
ISBN:9781337569798
Author:ECKERT
Publisher:CENGAGE L
CMPTR
Computer Science
ISBN:9781337681872
Author:PINARD
Publisher:Cengage