submission
pdf
keyboard_arrow_up
School
University of Virginia *
*We aren’t endorsed by this school
Course
MISC
Subject
Mathematics
Date
Jan 9, 2024
Type
Pages
15
Uploaded by ethankwok7
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
1/15
Q1 Notes in the Assignment and How to Answer
0 Points
Many of the questions in this homework have answers that are decimal numbers. In order to ensure an exact match, please carefully follow the following formatting for your numerical answers.
Do not round decimals. None of the answers are infinite decimals, so include full precision (all answers should be less than 5 places after the decimal).
Do not include any leading or trailing 0s unless they are necessary to show the location of the decimal
If the number is an integer, do not include a decimal
Examples:
.1234
-.001
10.4
-10
0
Q2 Value Iteration
6 Points
Consider the following 1 dimensional gridworld where actions (LEFT or RIGHT) are deterministic (they always work as expected).
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
2/15
For state , there is an exit action available that results in going to a terminal state and collected a reward of 10. For state , the reward for the exit action is 1. Exit actions are successful 100% of the time.
The discount factor for this problem is Q2.1
1 Point
0
Q2.2
1 Point
0
Q2.3
1 Point
1
Q2.4
1 Point
1
Q2.5
1 Point
a
e
γ
= 1
V
(
d
) =
0
V
(
d
) =
1
V
(
d
) =
2
V
(
d
) =
3
V
(
d
) =
4
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
3/15
10
Q2.6
1 Point
10
Q3 Value Iteration
5 Points
Consider the following 1 dimensional gridworld where actions (LEFT or RIGHT) are deterministic (they always work as expected). For state , there is an exit action available that results in going to a terminal state and collected a reward of 10. For state , the reward for the exit action is 1. Exit actions are successful 100% of the time. All other rewards (the living penalty) is 0.
The discount factor for this problem is .
Q3.1
1 Point
10
V
(
d
) =
5
a
e
γ
= 0.2
V
(
a
) =
∗
V
(
a
) =
∞
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
4/15
Q3.2
1 Point
2
Q3.3
1 Point
.4
Q3.4
1 Point
.2
Q3.5
1 Point
1
Q4 Value Iteration with Nondeterministic Actions
5 Points
This question is involved, so, I advise working it out on paper.
Consider the following transition diagram, transition function, and reward function for an MDP. CW
means clockwise and CCW
means counterclockwise.
V
(
b
) =
∗
V
(
b
) =
∞
V
(
c
) =
∗
V
(
c
) =
∞
V
(
d
) =
∗
V
(
d
) =
∞
V
(
e
) =
∗
V
(
e
) =
∞
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
5/15
a
action
s'
T(s, a, s')
R(s, a, s')
A
CW
B
1.0
0
A
CCW
C
1.0
-2.0
B
CW
C
0.4
-1.0
B
CW
A
0.6
2.0
B
CCW
A
0.6
2.0
B
CCW
C
0.4
-1.0
C
CW
A
0.6
2.0
C
CW
B
0.4
2.0
C
CCW
A
0.4
2.0
C
CCW
B
0.5
0
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
6/15
a
action
s'
T(s, a, s')
R(s, a, s')
The discount factor is .
Q4.1
1 Point
After iteration of value iteration, the following values have been computed:
0.4
1.4
2.16
What is .7
Q4.2
1 Point
Assume that we have run value iteration to completion (convergence) and found the following value function 0.881
1.761
2.616
Compute .8805
Q4.3
1 Point
Compute -.692
γ
= 0.5
k
V
(
A
)
k
V
(
B
)
k
V
(
C
)
k
V
(
A
) =
k
+1
V
∗
V
(
A
)
∗
V
(
B
)
∗
V
(
C
)
∗
Q
(
A
,
CW
)
∗
Q
(
A
,
CCW
)
∗
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
7/15
Q4.4
1 Point
Compute 1.5875
Q4.5
1 Point
What is the optimal action for state ?
Q5 Policy Iteration
2 Points
Consider the gridworld where left and right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to a terminal state and collect reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor is 1.
Consider the policy illustrated below:
Hint:
The 1 in identifies the policy (do not
think about timesteps for this problem). There is no living reward, so, convergence is easily achieved.
Q
(
B
,
CW
)
∗
A
clockwise (CW)
counterclockwise (CCW)
can not determine
a
e
γ
( )
V
π
1
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
8/15
Q5.1
1 Point
Compute 1
Q5.2
1 Point
Compute 1
Q6 Policy Iteration
5 Points
Consider the gridworld where left and right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to a terminal state and collect reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time. The discount factor is 1.
Consider the policy illustrated below:
Hint
: The 2 in identifies the policy, so, for this problem do not
think about the timestep. There is no living reward, so, convergence is easily achieved.
Q6.1
V
(
c
)
π
1
V
(
a
)
π
1
a
e
γ
( )
V
π
2
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
9/15
1 Point
Compute 10
Q6.2
1 Point
Compute 10
Q6.3
1 Point
Compute 10
Q6.4
1 Point
Compute 1
Q6.5
1 Point
Compute 1
Q7 Policy Iteration
5 Points
V
(
a
)
π
2
V
(
b
)
π
2
V
(
c
)
π
2
V
(
d
)
π
2
V
(
e
)
π
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
10/15
Consider the gridworld where left and right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From state , there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state , the reward for the exit action is 1. Exit actions are successful 100% of the time.
The discount factor is 0.9.
Consider that one round of policy iteration
and the policy shown below.
Q7.1
1 Point
Compute 10
Q7.2
1 Point
Compute 9
a
e
γ
( )
π
i
V
(
a
)
π
i
V
(
b
)
π
i
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
11/15
Q7.3
1 Point
Compute 0
Q7.4
1 Point
Compute 0
Q7.5
1 Point
Compute 1
Q8
1 Point
Consider the following transition diagram, transition function, and reward function for an MDP.
This model will use a discount factor = 0.5
V
(
c
)
π
i
V
(
d
)
π
i
V
(
e
)
π
i
γ
( )
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
12/15
a
action
s'
T(s, a, s')
R(s, a, s')
A
CW
B
0.8
0
A
CW
C
0.2
2.0
A
CCW
B
0.4
1.0
A
CCW
C
0.6
0.0
B
CW
C
1.0
-1.0
B
CCW
A
0.6
-2.0
B
CCW
C
0.4
1.0
C
CW
A
1.0
-2.0
C
CCW
A
0.2
0.0
C
CCW
B
0.8
-1.0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
13/15
a
action
s'
T(s, a, s')
R(s, a, s')
The table below illustrates the current policy as part of a policy evaluation. The second table shows the current estimates (
) of the states when following the current policy.
state
action
A
CCW
B
CCW
C
CCW
V
value
0.0
-.84
-1.08
Q8.1 Policy Iteration with Nondeterministic Actions
1 Point
Compute -.092
[[more to follow]
V
V
(
A
)
k
π
V
(
B
)
k
π
V
(
C
)
k
π
V
(
A
)
k
+1
π
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
14/15
Graded
HW 9 MDPs
Student
Matt Wong
Total Points
29 / 29 pts
Question 1
Notes in the Assignment and How to Answer
0 / 0 pts
Question 2
Value Iteration
6 / 6 pts
2.1
(no title)
1 / 1 pt
2.2
(no title)
1 / 1 pt
2.3
(no title)
1 / 1 pt
2.4
(no title)
1 / 1 pt
2.5
(no title)
1 / 1 pt
2.6
(no title)
1 / 1 pt
Question 3
Value Iteration
5 / 5 pts
3.1
(no title)
1 / 1 pt
3.2
(no title)
1 / 1 pt
3.3
(no title)
1 / 1 pt
3.4
(no title)
1 / 1 pt
3.5
(no title)
1 / 1 pt
Question 4
Value Iteration with Nondeterministic Actions
5 / 5 pts
4.1
(no title)
1 / 1 pt
4.2
(no title)
1 / 1 pt
4.3
(no title)
1 / 1 pt
4.4
(no title)
1 / 1 pt
4.5
(no title)
1 / 1 pt
Question 5
Policy Iteration
2 / 2 pts
5.1
(no title)
1 / 1 pt
5.2
(no title)
1 / 1 pt
3/23/23, 10:12 PM
View Submission | Gradescope
https://www.gradescope.com/courses/477067/assignments/2755499/submissions/169766737
15/15
Question 6
Policy Iteration
5 / 5 pts
6.1
(no title)
1 / 1 pt
6.2
(no title)
1 / 1 pt
6.3
(no title)
1 / 1 pt
6.4
(no title)
1 / 1 pt
6.5
(no title)
1 / 1 pt
Question 7
Policy Iteration
5 / 5 pts
7.1
(no title)
1 / 1 pt
7.2
(no title)
1 / 1 pt
7.3
(no title)
1 / 1 pt
7.4
(no title)
1 / 1 pt
7.5
(no title)
1 / 1 pt
Question 8
(no title)
1 / 1 pt
8.1
Policy Iteration with Nondeterministic Actions
1 / 1 pt
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help