Consider the case of a simple Markov Decision Process (MDP) with a discount factor gamma = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states and y there are two possible actions: a₁ and a2. The transition model is as follows: In state x, action a1 moves the agent to state y with probability 0.9 and makes the agent stay put with probability 0.1. In state y, action a1 moves the agent to state with probability 0.9 and makes the agent stay put with probability 0.1. In either state or state y, action a2 moves the agent to state z with probability 0.1 and makes the agent stay put with probability 0.9. Please answer the following questions: Draw a picture of the MDP What can be determined qualitatively about the optimal policy in states x and y? Apply the policy iteration algorithm discuss in class, showing each step in full, to determine the optimal policy and the values of states and y. Assume that the initial policy has action a2 in both states. What happens to policy iteration if instead, the initial policy has action a1 in both states? Does discounting help? Does the optimal policy depend on the discount factor?

Computer Networking: A Top-Down Approach (7th Edition)

7th Edition

ISBN:9780133594140

Author:James Kurose, Keith Ross

Publisher:James Kurose, Keith Ross

Chapter1: Computer Networks And The Internet

Section: Chapter Questions

Problem R1RQ: What is the difference between a host and an end system? List several different types of end...

See similar textbooks

Related questions

Question

Consider the case of a simple Markov Decision Process (MDP) with a discount factor gamma = 1. The MDP has three states (x, y, and z), with rewards -1, -2, 0, respectively. State z is considered a terminal state. In states and y there are two possible actions: a₁ and a2.

The transition model is as follows:

In state x, action a1 moves the agent to state y with probability 0.9 and makes the agent stay put with probability 0.1.
In state y, action a1 moves the agent to state with probability 0.9 and makes the agent stay put with probability 0.1.
In either state or state y, action a2 moves the agent to state z with probability 0.1 and makes the agent stay put with probability 0.9.

Please answer the following questions:

Draw a picture of the MDP
What can be determined qualitatively about the optimal policy in states x and y?
Apply the policy iteration algorithm discuss in class, showing each step in full, to determine the optimal policy and the values of states and y. Assume that the initial policy has action a2 in both states.
What happens to policy iteration if instead, the initial policy has action a1 in both states? Does discounting help? Does the optimal policy depend on the discount factor?

Expert Solution