COMSW4701_HW3

pdf

School

Columbia University *

*We aren’t endorsed by this school

Course

4701

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by SuperGoldfish3864

COMS W4701: Artificial Intelligence, Spring 2023 Homework 3 Instructions: Compile all solutions, plots, and analysis for this assignment in a single PDF file (typed, or handwritten legibly if absolutely necessary). Problems 2 and 3 will ask you to use and minimally modify provided scripts; you will not be turning in your code for these problems. For problem 4, implement your solutions directly in the provided Python file(s) and turn these in along with the PDf file to Gradescope. Please be mindful of the deadline, as late submissions are not accepted, as well as our course policies on academic honesty. Problem 1: Twenty-One (16 points) Consider a one-player version of the game twenty-one as a Markov decision process. The objective is to draw cards one at a time from an infinite deck of playing cards and acquire a card sum as large as possible without going over 21. For now we will have ten integer states in { 12 , ..., 21 } representing the card sum (sums smaller than 12 are trivially played). At each turn we can take one of two actions from state s . Stopping yields a reward equal to s and immediately ends the game. Hitting yields zero reward, and we will either transition to a state s ′ with probability 1 13 where s < s ′ ≤ 21, or immediately end the game (“bust”) with probability s − 8 13 . 1. Write down the Bellman optimality equations for the value V ( s ) of each state. While you can do so by writing 10 separate equations, each equation form is similar enough that you can write one compact expression for V ( s ) that applies to all of them (use summation notation). 2. We perform value iteration starting with V 0 ( s ) = 0 for all states. What are the values V 1 ( s ) in the next iteration? Have we found the optimal values V ∗ ? 3. Now consider taking a policy iteration approach. Suppose that we have an initial policy π 1 , which is to hit in every state. What are the associated values V π 1 ( s )? You should be able to reason this out without having to solve a linear system or write an iterative program. 4. Perform the next step of policy iteration to find π 2 . Have we found the optimal policy π ∗ ? Problem 2: Twenty-One Redux (18 points) We will now modify the twenty-one MDP to incorporate a special rule regarding the ace card. If a player’s hand contains an ace, it can add either 1 or 11 to the card sum, provided that the total sum does not exceed 21. If the latter case is possible, we say that the hand contains a usable ace . The number of states is now doubled to twenty, and each can be represented by a tuple s = ( sum, ace ), where sum is the card sum as before and ace can be either True or False . Suppose we take an action from state s = ( sum, ace ). As before, stopping returns reward equal to sum and immediately ends the game. The possible transitions for hitting are as follows, all with zero reward: 1

• Transition to state s ′ = ( sum ′ , ace ), where sum < sum ′ ≤ 21, with probability 1 13 . • If ace is True , transition to state s ′ = ( sum ′ , False ), where 12 ≤ sum ′ < sum , with proba- bility 1 13 , or state s ′ = ( sum, False ) with probability 4 13 . • Else if ace is False , immediately end the game with probability sum − 8 13 . As an example, let s = (16 , True ). Our hand has a usable ace (ace value is 11) and card sum 16. We then choose to hit . We may draw a value between 1 and 5, which brings us to states (17 , True ) , ..., (21 , True ) with probability 1 13 each. We may draw a value between 6 and 9, which brings the card sum over 21, so we “remove” usability of the ace and decrease the excessive card sum by 10. This brings us to states (12 , False ) , ..., (15 , False ) with probability 1 13 each. Finally, we may draw a value of 10, which brings our state to (16 , False ) with probability 4 13 . The provided twentyone.ipynb script initializes an array storing the 20 state values. The first 10 are the ace = False states, and the last 10 are the ace = True states. Given a discount factor gamma , the value iteration() method can be used to find the optimal values for this problem. The code outside the main loop assigns the transition probabilities for hitting in a 20 × 20 array, allowing us to compute the Bellman update using a simple matrix-vector product. 1. Run three instances of value iteration with the following values of discount factor: γ = 1, γ = 0 . 9, and γ = 0 . 5. For each instance, use matplotlib to produce a separate scatter plot with the x-axis ranging from 12 to 21. Use two different markers to plot the values of states with and without a usable ace, and show a legend as well. Copy these plots to your PDF. 2. What are the optimal values and actions of the states with no usable ace? Explain why they either differ or remain the same (depending on your observations) from those of the previous problem when aces could only have a value of 1. 3. For γ = 1, what are the optimal actions in each state with a usable ace? Explain the differences in strategy compared to the corresponding states without a usable ace. 4. Explain how and why the optimal strategies in usable ace states change as we decrease γ . Problem 3: Bernoulli Bandits (18 points) You will be simulating a K -armed Bernoulli bandit in bandits.ipynb ; each arm gives either a reward of 1 with a set probability p or 0 with probability 1 − p . means is a list of the p probabilities for each arm. strat and param describe the strategy used to play this bandit ( ε -greedy or UCB) along with the parameter value ( ε or α ), and the bandit is simulated M times for T timesteps each. The returned quantities are the average regret and average frequency that the best (highest mean) arm is chosen at each timestep. execute() simulates the bandit given the same arguments above, except that params is expected to be a list, so that we can compare the simulation results for different values of ε or α . The returned values from bernoulli bandit() are plotted vs time on separate plots (note that regret is plotted cumulatively). The x-axis is logarithmic, so that the trends are better visualized over time. Thus, logarithmic growth will appear to be linear on these plots, while linear growth will appear to be exponential. 2

For each part of this problem, you will only need to set the arguments as instructed and then call execute() . You will then copy these plots into your submission and provide analysis. Make sure to have a good understanding for what each plot is showing you before responding to the questions. 1. Let’s first try an “easier” problem: a 2-armed bandit with means 0 . 3 and 0 . 7, simulated for T = 10000 timesteps and averaged over M = 100 simulations. Use ε -greedy with ε values 0 . 1 , 0 . 2 , 0 . 3, and then use UCB with α values 0 . 1 , 0 . 2 , 0 . 3 (you can ignore any divide by zero warnings). Show the two sets of plots and briefly answer the following questions. • For ε -greedy, which parameter value does better on the two metrics over the long term? Why might your answer change if we look at a shorter time period (e.g., t < 100)? • How does the value of α affect whether UCB does better or worse than ε -greedy? How does α affect the order of growth of regret? 2. Now let’s consider a “harder” 2-armed bandit with means 0 . 5 and 0 . 55. Perform the same simulations as in the previous part with all other values unchanged. Show the two sets of plots and briefly answer the following questions. • Compare ε -greedy’s performance on this problem vs the previous one on the two metrics. For this problem, do we need greater or fewer (or about the same) number of trials to maximize the frequency of playing the best arm? • Compare the performance of ε -greedy vs UCB on this problem on the two metrics. Which generally seems to be less sensitive to varying parameter values and why? 3. For the last experiment, simulate a 4-armed bandit with means 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8, keeping all other values the same. Do so using both ε -greedy and UCB, and show both sets of plots. Briefly answer the following questions. • Among the three bandit problems, explain why ε -greedy performs the worst on this one in terms of regret, particularly for larger values of ε . • How does UCB compare to ε -greedy here when α is properly chosen (i.e., look at the best performance of UCB)? Explain any differences that you observe. Problem 4: Frozen Lake (48 points) Frozen Lake is a classical test example for reinforcement learning algorithms included in the Gym API. It is a grid world problem in which most states (“Frozen”) yield zero reward, the “Goal” is a terminal state that yields +1 reward, and all other terminal states (“Holes”) yield zero reward. The agent may move in one of the four cardinal directions from any Frozen state, but transitions may be stochastic and unknown by the agent, so it will employ reinforcement learning. First set up the poetry virtual environment as in past homeworks. It is essential that you do so to avoid importing and setting up the Gym libraries yourself. You will be implementing certain functions in the QLearningAgent class in agent.py . To run your code and simulate the environ- ment, you can modify parameters in and run main.py . The agent will then train for 10000 episodes and evaluate its learned policy on 100 episodes. If the render parameter is set to True , a small graphic will pop up showing the agent’s trajectory during each evaluation episode. In addition, the evaluation success rate, learned state values (remember that V ( s ) = max a Q ( s, a )), true state values (computed using value iteration), and value error norm will be printed on the console. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

4.1: Q-Learning Routines (14 points) Implement the basic Q-learning routines in get action() and learn() . Given a state , get action() returns an action as an index value between 0 and the number of actions. If train is true, selection should be done following ε -greedy; otherwise, it should be done in a purely greedy manner. Both methods should reference policy , a | S | × | A | NumPy array holding the Q-values. learn() computes and returns the temporal difference error, given current state , action taken, reward observed, and next state . The appropriate Q-value should also be updated in policy . 4.2: Simulation Routine (10 points) Complete the QLearningAgent by implementing run() , which runs the frozen lake simulation max episodes times, each up to max steps . You should use the routines you implemented above, and a dictionary mapping each episode number to total rewards is returned. You should reference the RandomAgent class to see the overall structure of this function. When finished, you can test your implementation on FrozenLake-v1 with the following configuration and default parameter values in main.py . You should see a 100% success rate and completely converged state values. 1 map_name = "4x4" 2 is_slippery = False 3 agent = agents.QLearningAgent(build_env, params) 4.3: Decaying Exploration (4 points) Let’s make FrozenLake a bit harder. Try running the following configuration, first with render set to True so that you can visualize the agent for a few episodes, then with render set to False to see the agent’s average performance over 100 episodes. You will probably see a dismal success rate, and most of the state values are pretty far off from where they should be. 1 map_name = "8x8" 2 is_slippery = True What happened? The grid is larger, so it will take longer for the agent to find the goal. To make things worse, state transitions are now stochastic; if the agent intends to move up , then it may end up moving left , right , up each with a probability of 1 3 . And as before, the agent only receives positive reward +1 when it reaches the goal state and 0 otherwise. This is called the sparse reward problem. The agent learns very little during early episodes and may fail to learn anything if it does not reach the goal. All of these issues make convergence to the optimal values a lot slower, as the agent would need to train on a lot more samples than in the previous problem. We need to employ smarter exploration strategies. One way to do so is to use a decaying ε -greedy Q-learning algorithm. We will start with a large value of epsilon to maximize exploration in the beginning, and with each subsequent episode we will multiply epsilon by a factor of eps decay . Note that setting eps decay to 1 will simply default the method back to constant ε -greedy. In addition, we may also have a lower bound min eps in case we do not want exploration to go away completely in the last set of episodes. 4

Modify your Q-learning implementation to incorporate decaying ε -greedy. It should be relatively simple to do so and require just a couple lines of code. 4.4: Decaying Learning Rate (4 points) Decaying ε will help a bit, but it will still not be enough. You can experiment a bit by picking a decay rate that will decrease ε appreciably but not completely over 10000 episodes, but you should see that the state values still do not consistently converge. Another obstacle is the constant learn- ing rate. Even though we have decaying exploration, the stochasticity of the problem means that we will continue performing updates when we take optimal actions, making our Q-values unstable. The idea here is the same as with decaying ε : multiply the learning rate by a factor lr decay after each episode, up until it reaches min lr . Modify your Q-learning implementation to incorporate a decaying learning rate. It should be relatively simple to do so and require just a couple lines of code. 4.5: Analysis (16 points) Let’s now study the effectiveness of the learning algorithm and improvements. Run each of the fol- lowing simulation settings, report your findings, and briefly answer any questions. Again, remember to set "render":False to complete the evaluation phase of each simulation. 1. Use a constant ε = 0 . 5 and constant learning rate 0 . 1 to simulate the 4 × 4 and 8 × 8 worlds, both with is slippery=False . What are the success rates and value error norms, and why are they so drastically different? What happens when you increase ε to 0 . 99? 2. Set is slippery=True and simulate the 4 × 4 world with the parameter values above (try both values of ε ), and report your findings. Why can’t we get a 100% success rate anymore? 3. Let’s move on to the slippery 8 × 8 world. How does constant ε -greedy perform for each of the two values above? Report your findings. 4. For this last part, let’s see if our decaying parameter implementations can help on the slippery 8 × 8 world. Set the starting epsilon and learning rate values to 1. Then for each of epsilon decay and learning rate decay , experiment with values 0 . 99, 0 . 999, and 0 . 9999 (and possibly others if you like). Report the combination of values that you found to give you the best results for success rate and value error norm. It is acceptable if multiple values appear to perform well for each parameter. Submission You should have one PDF document containing your analysis for each problem, as well as the completed Python file agent.py implementing problem 4. Submit the document and code file to the respective assignment bins on Gradescope. For full credit, you must tag your pages for each given problem on the former. 5

COMSW4701_HW3

Related Documents