Backward View 0.0/1.0 point (graded) Below we show a gridworld and a particular trajectory taken by the agent. The agent gets a reward of 0 at every step, except for the last step when it reaches the goal, receiving a reward of 1. We label some of the states and will refer to those states by the labled number. We use a tabular representation, meaning that the features are represented as a one-hot vector (i.e. the feature corresponding to each square is 1 if the agent is in that square, and 0 otherwise, meaning that only one feature will be non-zero at any given time). G 2 5 сл 4 Let us look more closely at the goal state and states 1-5, and consider the one-hot features just for those 6 states. When the agent is at state 5, it's features will be x (5) = (0, 0, 0, 0, 0, 1), and features for state 1 will be (1)=(0, 1, 0, 0, 0, 0). This means that V (5, w) = (0, 0, 0, 0, 0, 1). For this exercise, discount y = 0.9 and trace decay λ = 0.3. In the trajectory given above, what will be the last component of z, the eligibility trace, after processing state 3? That is, what will be the value of the component of z associated with state 5 after the component associated with state 3 has been added to it? TD Error 0.0/1.0 point (graded) Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15, V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, and the discount rate y = 0.9. What is the TD-error & for the transition from state 5 to state 4? Submit You have used 0 of 500 attempts Save Update 0.0/1.0 point (graded) == Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15, V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, V (goal, w) = 0 and the discount rate = 0.9 and learning rate a = 0.1. What will be the new value of state 5 (i.e. V (5, w)) if we make updates according to TD(A) at each step with λ = 0.3, at the end of the episode? Please specify the value up to 4 decimal digits.

Database System Concepts
7th Edition
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Chapter1: Introduction
Section: Chapter Questions
Problem 1PE
icon
Related questions
Question
Don't use ai to answer I will report your answer Solve it Asap with explanation and calculation
Backward View
0.0/1.0 point (graded)
Below we show a gridworld and a particular trajectory taken by the agent. The agent gets a reward of 0 at
every step, except for the last step when it reaches the goal, receiving a reward of 1. We label some of the
states and will refer to those states by the labled number. We use a tabular representation, meaning that the
features are represented as a one-hot vector (i.e. the feature corresponding to each square is 1 if the agent is
in that square, and 0 otherwise, meaning that only one feature will be non-zero at any given time).
G
2
5
сл
4
Let us look more closely at the goal state and states 1-5, and consider the one-hot features just for those 6
states. When the agent is at state 5, it's features will be x (5) = (0, 0, 0, 0, 0, 1), and features for state 1 will
be (1)=(0, 1, 0, 0, 0, 0). This means that V (5, w) = (0, 0, 0, 0, 0, 1).
For this exercise, discount y = 0.9 and trace decay λ = 0.3.
In the trajectory given above, what will be the last component of z, the eligibility trace, after processing state
3? That is, what will be the value of the component of z associated with state 5 after the component
associated with state 3 has been added to it?
Transcribed Image Text:Backward View 0.0/1.0 point (graded) Below we show a gridworld and a particular trajectory taken by the agent. The agent gets a reward of 0 at every step, except for the last step when it reaches the goal, receiving a reward of 1. We label some of the states and will refer to those states by the labled number. We use a tabular representation, meaning that the features are represented as a one-hot vector (i.e. the feature corresponding to each square is 1 if the agent is in that square, and 0 otherwise, meaning that only one feature will be non-zero at any given time). G 2 5 сл 4 Let us look more closely at the goal state and states 1-5, and consider the one-hot features just for those 6 states. When the agent is at state 5, it's features will be x (5) = (0, 0, 0, 0, 0, 1), and features for state 1 will be (1)=(0, 1, 0, 0, 0, 0). This means that V (5, w) = (0, 0, 0, 0, 0, 1). For this exercise, discount y = 0.9 and trace decay λ = 0.3. In the trajectory given above, what will be the last component of z, the eligibility trace, after processing state 3? That is, what will be the value of the component of z associated with state 5 after the component associated with state 3 has been added to it?
TD Error
0.0/1.0 point (graded)
Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15,
V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, and the discount rate y = 0.9.
What is the TD-error & for the transition from state 5 to state 4?
Submit
You have used 0 of 500 attempts
Save
Update
0.0/1.0 point (graded)
==
Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15,
V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, V (goal, w) = 0 and the discount rate = 0.9 and learning
rate a = 0.1.
What will be the new value of state 5 (i.e. V (5, w)) if we make updates according to TD(A) at each step with
λ = 0.3, at the end of the episode? Please specify the value up to 4 decimal digits.
Transcribed Image Text:TD Error 0.0/1.0 point (graded) Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15, V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, and the discount rate y = 0.9. What is the TD-error & for the transition from state 5 to state 4? Submit You have used 0 of 500 attempts Save Update 0.0/1.0 point (graded) == Now consider that the states are initialized with the following values: V (1, w) = 10, V (2, w) = 15, V (3, w) = 20, V (4, w) = 25, V (5, w) = 30, V (goal, w) = 0 and the discount rate = 0.9 and learning rate a = 0.1. What will be the new value of state 5 (i.e. V (5, w)) if we make updates according to TD(A) at each step with λ = 0.3, at the end of the episode? Please specify the value up to 4 decimal digits.
Expert Solution
steps

Step by step

Solved in 2 steps

Blurred answer
Recommended textbooks for you
Database System Concepts
Database System Concepts
Computer Science
ISBN:
9780078022159
Author:
Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:
McGraw-Hill Education
Starting Out with Python (4th Edition)
Starting Out with Python (4th Edition)
Computer Science
ISBN:
9780134444321
Author:
Tony Gaddis
Publisher:
PEARSON
Digital Fundamentals (11th Edition)
Digital Fundamentals (11th Edition)
Computer Science
ISBN:
9780132737968
Author:
Thomas L. Floyd
Publisher:
PEARSON
C How to Program (8th Edition)
C How to Program (8th Edition)
Computer Science
ISBN:
9780133976892
Author:
Paul J. Deitel, Harvey Deitel
Publisher:
PEARSON
Database Systems: Design, Implementation, & Manag…
Database Systems: Design, Implementation, & Manag…
Computer Science
ISBN:
9781337627900
Author:
Carlos Coronel, Steven Morris
Publisher:
Cengage Learning
Programmable Logic Controllers
Programmable Logic Controllers
Computer Science
ISBN:
9780073373843
Author:
Frank D. Petruzella
Publisher:
McGraw-Hill Education