COMSW4701_HW3
pdf
keyboard_arrow_up
School
Columbia University *
*We aren’t endorsed by this school
Course
4701
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
5
Uploaded by SuperGoldfish3864
COMS W4701: Artificial Intelligence, Spring 2023
Homework 3
Instructions:
Compile all solutions, plots, and analysis for this assignment in a single PDF file
(typed, or handwritten
legibly
if absolutely necessary). Problems 2 and 3 will ask you to use and
minimally modify provided scripts; you will not be turning in your code for these problems. For
problem 4, implement your solutions directly in the provided Python file(s) and turn these in along
with the PDf file to Gradescope.
Please be mindful of the deadline, as late submissions are not
accepted, as well as our course policies on academic honesty.
Problem 1: Twenty-One (16 points)
Consider a one-player version of the game twenty-one as a Markov decision process. The objective
is to draw cards one at a time from an infinite deck of playing cards and acquire a card sum as
large as possible without going over 21.
For now we will have ten integer states in
{
12
, ...,
21
}
representing the card sum (sums smaller than 12 are trivially played). At each turn we can take
one of two actions from state
s
.
Stopping
yields a reward equal to
s
and immediately ends the
game.
Hitting
yields zero reward, and we will either transition to a state
s
′
with probability
1
13
where
s < s
′
≤
21, or immediately end the game (“bust”) with probability
s
−
8
13
.
1. Write down the Bellman optimality equations for the value
V
(
s
) of each state. While you can
do so by writing 10 separate equations, each equation form is similar enough that you can
write one compact expression for
V
(
s
) that applies to all of them (use summation notation).
2. We perform value iteration starting with
V
0
(
s
) = 0 for all states. What are the values
V
1
(
s
)
in the next iteration? Have we found the optimal values
V
∗
?
3. Now consider taking a policy iteration approach. Suppose that we have an initial policy
π
1
,
which is to
hit
in every state. What are the associated values
V
π
1
(
s
)? You should be able to
reason this out without having to solve a linear system or write an iterative program.
4. Perform the next step of policy iteration to find
π
2
. Have we found the optimal policy
π
∗
?
Problem 2: Twenty-One Redux (18 points)
We will now modify the twenty-one MDP to incorporate a special rule regarding the ace card. If a
player’s hand contains an ace, it can add either 1 or 11 to the card sum, provided that the total sum
does not exceed 21. If the latter case is possible, we say that the hand contains a
usable ace
. The
number of states is now doubled to twenty, and each can be represented by a tuple
s
= (
sum, ace
),
where
sum
is the card sum as before and
ace
can be either
True
or
False
.
Suppose we take an action from state
s
= (
sum, ace
). As before,
stopping
returns reward equal to
sum
and immediately ends the game. The possible transitions for
hitting
are as follows, all with
zero reward:
1
•
Transition to state
s
′
= (
sum
′
, ace
), where
sum < sum
′
≤
21, with probability
1
13
.
•
If
ace
is
True
, transition to state
s
′
= (
sum
′
, False
), where 12
≤
sum
′
< sum
, with proba-
bility
1
13
, or state
s
′
= (
sum, False
) with probability
4
13
.
•
Else if
ace
is
False
, immediately end the game with probability
sum
−
8
13
.
As an example, let
s
= (16
, True
).
Our hand has a usable ace (ace value is 11) and card sum
16.
We then choose to
hit
.
We may draw a value between 1 and 5, which brings us to states
(17
, True
)
, ...,
(21
, True
) with probability
1
13
each. We may draw a value between 6 and 9, which
brings the card sum over 21, so we “remove” usability of the ace and decrease the excessive card
sum by 10. This brings us to states (12
, False
)
, ...,
(15
, False
) with probability
1
13
each. Finally,
we may draw a value of 10, which brings our state to (16
, False
) with probability
4
13
.
The provided
twentyone.ipynb
script initializes an array storing the 20 state values. The first 10
are the
ace
=
False
states, and the last 10 are the
ace
=
True
states.
Given a discount factor
gamma
, the
value
iteration()
method can be used to find the optimal values for this problem.
The code outside the main loop assigns the transition probabilities for
hitting
in a 20
×
20 array,
allowing us to compute the Bellman update using a simple matrix-vector product.
1. Run three instances of value iteration with the following values of discount factor:
γ
= 1,
γ
= 0
.
9, and
γ
= 0
.
5. For each instance, use
matplotlib
to produce a separate scatter plot
with the x-axis ranging from 12 to 21. Use two different markers to plot the values of states
with and without a usable ace, and show a legend as well. Copy these plots to your PDF.
2. What are the optimal values and actions of the states with no usable ace? Explain why they
either differ or remain the same (depending on your observations) from those of the previous
problem when aces could only have a value of 1.
3. For
γ
= 1, what are the optimal actions in each state with a usable ace?
Explain the
differences in strategy compared to the corresponding states without a usable ace.
4. Explain how and why the optimal strategies in usable ace states change as we decrease
γ
.
Problem 3: Bernoulli Bandits (18 points)
You will be simulating a
K
-armed
Bernoulli
bandit in
bandits.ipynb
; each arm gives either a
reward of 1 with a set probability
p
or 0 with probability 1
−
p
.
means
is a list of the
p
probabilities
for each arm.
strat
and
param
describe the strategy used to play this bandit (
ε
-greedy or UCB)
along with the parameter value (
ε
or
α
), and the bandit is simulated
M
times for
T
timesteps each.
The returned quantities are the average regret and average frequency that the best (highest mean)
arm is chosen at each timestep.
execute()
simulates the bandit given the same arguments above, except that
params
is expected
to be a list, so that we can compare the simulation results for different values of
ε
or
α
.
The
returned values from
bernoulli
bandit()
are plotted vs time on separate plots (note that regret
is plotted cumulatively). The x-axis is logarithmic, so that the trends are better visualized over
time.
Thus, logarithmic growth will appear to be linear on these plots, while linear growth will
appear to be exponential.
2
For each part of this problem, you will only need to set the arguments as instructed and then call
execute()
. You will then copy these plots into your submission and provide analysis. Make sure
to have a good understanding for what each plot is showing you before responding to the questions.
1. Let’s first try an “easier” problem: a 2-armed bandit with means 0
.
3 and 0
.
7, simulated for
T
= 10000 timesteps and averaged over
M
= 100 simulations. Use
ε
-greedy with
ε
values
0
.
1
,
0
.
2
,
0
.
3, and then use UCB with
α
values 0
.
1
,
0
.
2
,
0
.
3 (you can ignore any divide by zero
warnings). Show the two sets of plots and briefly answer the following questions.
•
For
ε
-greedy, which parameter value does better on the two metrics over the long term?
Why might your answer change if we look at a shorter time period (e.g.,
t <
100)?
•
How does the value of
α
affect whether UCB does better or worse than
ε
-greedy? How
does
α
affect the order of growth of regret?
2. Now let’s consider a “harder” 2-armed bandit with means 0
.
5 and 0
.
55. Perform the same
simulations as in the previous part with all other values unchanged.
Show the two sets of
plots and briefly answer the following questions.
•
Compare
ε
-greedy’s performance on this problem vs the previous one on the two metrics.
For this problem, do we need greater or fewer (or about the same) number of trials to
maximize the frequency of playing the best arm?
•
Compare the performance of
ε
-greedy vs UCB on this problem on the two metrics.
Which generally seems to be less sensitive to varying parameter values and why?
3. For the last experiment, simulate a 4-armed bandit with means 0
.
2
,
0
.
4
,
0
.
6
,
0
.
8, keeping all
other values the same.
Do so using both
ε
-greedy and UCB, and show both sets of plots.
Briefly answer the following questions.
•
Among the three bandit problems, explain why
ε
-greedy performs the worst on this one
in terms of regret, particularly for larger values of
ε
.
•
How does UCB compare to
ε
-greedy here when
α
is properly chosen (i.e., look at the
best performance of UCB)? Explain any differences that you observe.
Problem 4: Frozen Lake (48 points)
Frozen Lake is a classical test example for reinforcement learning algorithms included in the Gym
API. It is a grid world problem in which most states (“Frozen”) yield zero reward, the “Goal” is
a terminal state that yields
+1
reward, and all other terminal states (“Holes”) yield zero reward.
The agent may move in one of the four cardinal directions from any Frozen state, but transitions
may be stochastic and unknown by the agent, so it will employ reinforcement learning.
First set up the poetry virtual environment as in past homeworks. It is essential that you do so
to avoid importing and setting up the Gym libraries yourself. You will be implementing certain
functions in the
QLearningAgent
class in
agent.py
. To run your code and simulate the environ-
ment, you can modify parameters in and run
main.py
. The agent will then train for 10000 episodes
and evaluate its learned policy on 100 episodes. If the
render
parameter is set to
True
, a small
graphic will pop up showing the agent’s trajectory during each evaluation episode.
In addition,
the evaluation success rate, learned state values (remember that
V
(
s
) = max
a
Q
(
s, a
)), true state
values (computed using value iteration), and value error norm will be printed on the console.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4.1: Q-Learning Routines (14 points)
Implement the basic Q-learning routines in
get
action()
and
learn()
. Given a
state
,
get
action()
returns an action as an index value between 0 and the number of actions. If
train
is true, selection
should be done following
ε
-greedy; otherwise, it should be done in a purely greedy manner. Both
methods should reference
policy
, a
|
S
| × |
A
|
NumPy array holding the Q-values.
learn()
computes and returns the temporal difference error, given current
state
,
action
taken,
reward
observed, and
next
state
. The appropriate Q-value should also be updated in
policy
.
4.2: Simulation Routine (10 points)
Complete the
QLearningAgent
by implementing
run()
, which runs the frozen lake simulation
max
episodes
times, each up to
max
steps
. You should use the routines you implemented above,
and a dictionary mapping each episode number to total rewards is returned. You should reference
the
RandomAgent
class to see the overall structure of this function. When finished, you can test your
implementation on
FrozenLake-v1
with the following configuration and default parameter values in
main.py
. You should see a 100% success rate and completely converged state values.
1
map_name = "4x4"
2
is_slippery = False
3
agent = agents.QLearningAgent(build_env, params)
4.3: Decaying Exploration (4 points)
Let’s make
FrozenLake
a bit harder. Try running the following configuration, first with
render
set
to
True
so that you can visualize the agent for a few episodes, then with
render
set to
False
to
see the agent’s average performance over 100 episodes. You will probably see a dismal success rate,
and most of the state values are pretty far off from where they should be.
1
map_name = "8x8"
2
is_slippery = True
What happened? The grid is larger, so it will take longer for the agent to find the goal. To make
things worse, state transitions are now stochastic; if the agent intends to move
up
, then it may end
up moving
left
,
right
,
up
each with a probability of
1
3
.
And as before, the agent only receives
positive reward
+1
when it reaches the goal state and
0
otherwise. This is called the
sparse reward
problem. The agent learns very little during early episodes and may fail to learn anything if it does
not reach the goal. All of these issues make convergence to the optimal values a lot slower, as the
agent would need to train on a lot more samples than in the previous problem.
We need to employ smarter exploration strategies. One way to do so is to use a
decaying
ε
-greedy
Q-learning algorithm. We will start with a large value of
epsilon
to maximize exploration in the
beginning, and with each subsequent episode we will multiply
epsilon
by a factor of
eps
decay
.
Note that setting
eps
decay
to 1 will simply default the method back to constant
ε
-greedy.
In
addition, we may also have a lower bound
min
eps
in case we do not want exploration to go away
completely in the last set of episodes.
4
Modify your Q-learning implementation to incorporate decaying
ε
-greedy.
It should be
relatively simple to do so and require just a couple lines of code.
4.4: Decaying Learning Rate (4 points)
Decaying
ε
will help a bit, but it will still not be enough. You can experiment a bit by picking a
decay rate that will decrease
ε
appreciably but not completely over 10000 episodes, but you should
see that the state values still do not consistently converge. Another obstacle is the constant learn-
ing rate. Even though we have decaying exploration, the stochasticity of the problem means that
we will continue performing updates when we take optimal actions, making our Q-values unstable.
The idea here is the same as with decaying
ε
: multiply the learning rate by a factor
lr
decay
after
each episode, up until it reaches
min
lr
.
Modify your Q-learning implementation to incorporate a decaying learning rate.
It
should be relatively simple to do so and require just a couple lines of code.
4.5: Analysis (16 points)
Let’s now study the effectiveness of the learning algorithm and improvements. Run each of the fol-
lowing simulation settings, report your findings, and briefly answer any questions. Again, remember
to set
"render":False
to complete the evaluation phase of each simulation.
1. Use a constant
ε
= 0
.
5 and constant learning rate 0
.
1 to simulate the 4
×
4 and 8
×
8 worlds,
both with
is
slippery=False
. What are the success rates and value error norms, and why
are they so drastically different? What happens when you increase
ε
to 0
.
99?
2. Set
is
slippery=True
and simulate the 4
×
4 world with the parameter values above (try
both values of
ε
), and report your findings. Why can’t we get a 100% success rate anymore?
3. Let’s move on to the slippery 8
×
8 world. How does constant
ε
-greedy perform for each of
the two values above? Report your findings.
4. For this last part, let’s see if our decaying parameter implementations can help on the slippery
8
×
8 world.
Set the starting
epsilon
and
learning
rate
values to 1.
Then for each of
epsilon
decay
and
learning
rate
decay
, experiment with values 0
.
99, 0
.
999, and 0
.
9999
(and possibly others if you like). Report the combination of values that you found to give
you the best results for success rate and value error norm. It is acceptable if multiple values
appear to perform well for each parameter.
Submission
You should have one PDF document containing your analysis for each problem, as well as the
completed Python file
agent.py
implementing problem 4. Submit the document and code file to
the respective assignment bins on Gradescope. For full credit, you must tag your pages for each
given problem on the former.
5
Related Documents
Recommended textbooks for you
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Recommended textbooks for you
- Programming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:CengageC++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology PtrEBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
- C++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningProgramming with Microsoft Visual Basic 2017Computer ScienceISBN:9781337102124Author:Diane ZakPublisher:Cengage LearningEBK JAVA PROGRAMMINGComputer ScienceISBN:9781305480537Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT