submission

pdf

School

University of Virginia *

*We aren’t endorsed by this school

Course

MISC

Subject

Mathematics

Date

Jan 9, 2024

Type

pdf

Pages

15

Uploaded by ethankwok7

Report
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 1/15 Q1 Notes in the Assignment and How to Answer 0 Points Many of the questions in this homework have answers that are decimal numbers. In order to ensure an exact match, please carefully follow the following formatting for your numerical answers. Do not round decimals. None of the answers are infinite decimals, so include full precision (all answers should be less than 5 places after the decimal). Do not include any leading or trailing 0s unless they are necessary to show the location of the decimal If the number is an integer, do not include a decimal Examples: .1234 -.001 10.4 -10 0 Q2 Value Iteration 6 Points Consider the following 1 dimensional gridworld where actions (LEFT or RIGHT) are deterministic (they always work as expected).
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 2/15 For state , there is an exit action available that results in going to a terminal state and collected a reward of 10. For state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor for this problem is Q2.1 1 Point 0 Q2.2 1 Point 0 Q2.3 1 Point 1 Q2.4 1 Point 1 Q2.5 1 Point a e γ = 1 V ( d ) = 0 V ( d ) = 1 V ( d ) = 2 V ( d ) = 3 V ( d ) = 4
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 3/15 10 Q2.6 1 Point 10 Q3 Value Iteration 5 Points Consider the following 1 dimensional gridworld where actions (LEFT or RIGHT) are deterministic (they always work as expected). For state , there is an exit action available that results in going to a terminal state and collected a reward of 10. For state , the reward for the exit action is 1. Exit actions are successful 100% of the time. All other rewards (the living penalty) is 0. The discount factor for this problem is . Q3.1 1 Point 10 V ( d ) = 5 a e γ = 0.2 V ( a ) = V ( a ) =
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 4/15 Q3.2 1 Point 2 Q3.3 1 Point .4 Q3.4 1 Point .2 Q3.5 1 Point 1 Q4 Value Iteration with Nondeterministic Actions 5 Points This question is involved, so, I advise working it out on paper. Consider the following transition diagram, transition function, and reward function for an MDP. CW means clockwise and CCW means counterclockwise. V ( b ) = V ( b ) = V ( c ) = V ( c ) = V ( d ) = V ( d ) = V ( e ) = V ( e ) =
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 5/15 a action s' T(s, a, s') R(s, a, s') A CW B 1.0 0 A CCW C 1.0 -2.0 B CW C 0.4 -1.0 B CW A 0.6 2.0 B CCW A 0.6 2.0 B CCW C 0.4 -1.0 C CW A 0.6 2.0 C CW B 0.4 2.0 C CCW A 0.4 2.0 C CCW B 0.5 0
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 6/15 a action s' T(s, a, s') R(s, a, s') The discount factor is . Q4.1 1 Point After iteration of value iteration, the following values have been computed: 0.4 1.4 2.16 What is .7 Q4.2 1 Point Assume that we have run value iteration to completion (convergence) and found the following value function 0.881 1.761 2.616 Compute .8805 Q4.3 1 Point Compute -.692 γ = 0.5 k V ( A ) k V ( B ) k V ( C ) k V ( A ) = k +1 V V ( A ) V ( B ) V ( C ) Q ( A , CW ) Q ( A , CCW )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 7/15 Q4.4 1 Point Compute 1.5875 Q4.5 1 Point What is the optimal action for state ? Q5 Policy Iteration 2 Points Consider the gridworld where left and right actions are successful 100% of the time. Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to a terminal state and collect reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor is 1. Consider the policy illustrated below: Hint: The 1 in identifies the policy (do not think about timesteps for this problem). There is no living reward, so, convergence is easily achieved. Q ( B , CW ) A clockwise (CW) counterclockwise (CCW) can not determine a e γ ( ) V π 1
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 8/15 Q5.1 1 Point Compute 1 Q5.2 1 Point Compute 1 Q6 Policy Iteration 5 Points Consider the gridworld where left and right actions are successful 100% of the time. Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to a terminal state and collect reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor is 1. Consider the policy illustrated below: Hint : The 2 in identifies the policy, so, for this problem do not think about the timestep. There is no living reward, so, convergence is easily achieved. Q6.1 V ( c ) π 1 V ( a ) π 1 a e γ ( ) V π 2
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 9/15 1 Point Compute 10 Q6.2 1 Point Compute 10 Q6.3 1 Point Compute 10 Q6.4 1 Point Compute 1 Q6.5 1 Point Compute 1 Q7 Policy Iteration 5 Points V ( a ) π 2 V ( b ) π 2 V ( c ) π 2 V ( d ) π 2 V ( e ) π 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 10/15 Consider the gridworld where left and right actions are successful 100% of the time. Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor is 0.9. Consider that one round of policy iteration and the policy shown below. Q7.1 1 Point Compute 10 Q7.2 1 Point Compute 9 a e γ ( ) π i V ( a ) π i V ( b ) π i
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 11/15 Q7.3 1 Point Compute 0 Q7.4 1 Point Compute 0 Q7.5 1 Point Compute 1 Q8 1 Point Consider the following transition diagram, transition function, and reward function for an MDP. This model will use a discount factor = 0.5 V ( c ) π i V ( d ) π i V ( e ) π i γ ( )
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 12/15 a action s' T(s, a, s') R(s, a, s') A CW B 0.8 0 A CW C 0.2 2.0 A CCW B 0.4 1.0 A CCW C 0.6 0.0 B CW C 1.0 -1.0 B CCW A 0.6 -2.0 B CCW C 0.4 1.0 C CW A 1.0 -2.0 C CCW A 0.2 0.0 C CCW B 0.8 -1.0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 13/15 a action s' T(s, a, s') R(s, a, s') The table below illustrates the current policy as part of a policy evaluation. The second table shows the current estimates ( ) of the states when following the current policy. state action A CCW B CCW C CCW V value 0.0 -.84 -1.08 Q8.1 Policy Iteration with Nondeterministic Actions 1 Point Compute -.092 [[more to follow] V V ( A ) k π V ( B ) k π V ( C ) k π V ( A ) k +1 π
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 14/15 Graded HW 9 MDPs Student Matt Wong Total Points 29 / 29 pts Question 1 Notes in the Assignment and How to Answer 0 / 0 pts Question 2 Value Iteration 6 / 6 pts 2.1 (no title) 1 / 1 pt 2.2 (no title) 1 / 1 pt 2.3 (no title) 1 / 1 pt 2.4 (no title) 1 / 1 pt 2.5 (no title) 1 / 1 pt 2.6 (no title) 1 / 1 pt Question 3 Value Iteration 5 / 5 pts 3.1 (no title) 1 / 1 pt 3.2 (no title) 1 / 1 pt 3.3 (no title) 1 / 1 pt 3.4 (no title) 1 / 1 pt 3.5 (no title) 1 / 1 pt Question 4 Value Iteration with Nondeterministic Actions 5 / 5 pts 4.1 (no title) 1 / 1 pt 4.2 (no title) 1 / 1 pt 4.3 (no title) 1 / 1 pt 4.4 (no title) 1 / 1 pt 4.5 (no title) 1 / 1 pt Question 5 Policy Iteration 2 / 2 pts 5.1 (no title) 1 / 1 pt 5.2 (no title) 1 / 1 pt
3/23/23, 10:12 PM View Submission | Gradescope https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737 15/15 Question 6 Policy Iteration 5 / 5 pts 6.1 (no title) 1 / 1 pt 6.2 (no title) 1 / 1 pt 6.3 (no title) 1 / 1 pt 6.4 (no title) 1 / 1 pt 6.5 (no title) 1 / 1 pt Question 7 Policy Iteration 5 / 5 pts 7.1 (no title) 1 / 1 pt 7.2 (no title) 1 / 1 pt 7.3 (no title) 1 / 1 pt 7.4 (no title) 1 / 1 pt 7.5 (no title) 1 / 1 pt Question 8 (no title) 1 / 1 pt 8.1 Policy Iteration with Nondeterministic Actions 1 / 1 pt
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help