Lab 3

pdf

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

402

Subject

Economics

Date

Feb 20, 2024

Type

pdf

Pages

14

Uploaded by munkeymango

Report
1. For the Epoch 1 sample trial 1 after the 1st trajectory, the trial ends with a reward of 10. The learning rate is 0.6 in this trial, and the state 1 for action 1 remains -1.1 the entire time. ( COMPARING ) The learning rate between 0.6 and 1.1 gives the same reward in the end. The learning rate 1.1 could include an intense state change, like the State changes in Lab 1. An example could be exploring strategically. The decrease from the learning rate at 1.1 to 0.6 could represent learning instead of exploring. A learning rate of 1.1 could represent the beginner state with the high learning rate, and 0.6 could represent a more experienced agent. After the Epoch 1 sample trial 1, the Q values were updated within the Basic Q Learner. The agent could be learning more with the decrease in the learning rate. ( CONTRASTING ) After the second trajectory, the agent goes for the actions with the higher Q values. The final reward value is 20 and higher than the Epoch 1 sample which makes sense because Sample Trial 2 also indicates a similar learning pattern. 2. B. The first trajectory goes off of a learning rate of 1.1. The reward was -1, so they only waited once with the waiting penalty being -1. Then the final reward ends up being -10001 with a crash in the second part in the 6th state. After the first trajectory, the learning rate is 0.6 in this Epoch Sample Trial 1. The decrease in the learning rate from 1.1 to 0.6 could mean the agent is learning more, like the learning rate in Task 1. There is more exploration. There are more filled Q-values. This could indicate that the agent is more knowledgeable about the environment with the Small parking MDP. The final reward for this trajectory is 1 less than the final reward for the 4th trajectory. The final reward for the 3rd trajectory (Epoch Sample Trial 2) is a lot worse being -10002. Sample Trial 1 starts to favor the higher values more often, which may mean that the agent is getting more experience. The Sample Trial 2 includes the agent crashing. The -1000 is a bit unclear, but this could indicate the agent getting out of the parking spot and scratching a car. The agent seems to be more careful and knowledgable in the Sample Trial 1 than the Sample Trial 2, although, both trials seem to include a final state in which they could have scratched a car in the parking lot. The agent in Sample Trial 2 seems to struggle more with learning than in Task 1; but, in the first trajectory, the total reward is much worse than the second which is similar to what was seen in Task 1. This is different than Task 1 where the final reward improves over time. The agent may have learned more on how to avoid losses. The average reward improves over the trials, but the final reward does not. The STD also may indicate that there is more learning going on over time.
D. 3. B.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
D.
E. The learning curves where there is varying greed, the average rewards are more spread out in the different learning curves. The much less greedy Q learner seems to have a lower average rewards in the start, and then the agent starts to get a higher average rewards as they seem to learn more. They do not end up with positive rewards, and this could be because of the amount of greediness. In terms of these graphs, it seems to be a better idea to be more greedy. They could end up waiting more or crashing because they are not being tough when parking. The less greedy Q learner starts out at a low average rewards, but not as low as the much less greedy Q learner. They end up having positive rewards consistently in the end which could mean they end up learning effectively. Lastly, the Greedier Q Learner starts out negative near -1000, and then they end up having positive rewards consistently. This could mean that they are learning effectively and the agent's balance between exploration and exploitation is getting better over time. All 3 Q
learners seem to start out with negative rewards, and apart from much less greedy Q learner, they end up with positive rewards. This is a result of varying greed, and higher greed seems to do the best when parking. In a busy parking lot with the busyrate at .9, being more greedy seems to get the best result. In order to find a parking space the quickest, you need to only think about yourself. Additionally, in the varying learning rate hyperparameters, all 3 learners seem to have a very similar Q-learning curve. The high LRQ learner seems to start out with a lower average reward than the Middle LRQ learner and High LRQ learner. The high LRQ learner has the highest learning rate being 1.0 while the Middle LRQ learner is 0.1 and the Low LRQ learner is 0.01. The high LRQ learner having the highest learning rate makes sense because it starts out with a lower average reward, so the agent needs more learning to be able to achieve a positive average reward. The Middle LRQ learner seems to start out the highest out of the other two learners, but it still starts out with a negative average rewards. With the middle and low LRQ learners having similar learning rates, they have very similar learning rates. Both of the agents within these learners seem to learn at similar rates, and the learners may be relying more on Q-learned information. 4. B.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
D.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
E. For the GreedierQLearner, the reward remains positive in the mostlyEmptyParkingLot. The probability of greed is very high, so this makes sense that there would be a positive reward for this. This could represent the agents finding a parking spot very quickly with no negative action happening. Even though the learning rate is 0.1, the agents seem to react well in this lot. In the halfbusyParkingLot, the GreedierQLearner seems to be in between positive and negative rewards consistently in the beginning. The learning rate is the same, but in a busier parking lot, it may be more difficult for the agents to find a spot successfully. The majority average rewards remains positive, but the agents start out struggling in between negative and positive rewards. This could represent them scratching another car or having to wait longer. Lastly, the VERYbusyParkingLot seems to output the most negative average rewards on the GreedierQLearner. With a busyrate of .99, the agents could struggle more on finding open spots and learning how to deal with it. The range of the average rewards is in the high negatives, so this could also represent them having to wait longer for a spot. For the LessGreedyQLearner, the MDP’s seem to cause a better performance than on the GreedierQLearner. The VeryBusyLot is where the Q Learner seems to perform the worst. For the VeryBusyLot, the agent starts out negative in the -1000s. The agent appears to achieve higher average rewards as the Epoch gets higher. In the graph, the agent may be experiencing difficulty at first in finding a parking spot in such a crowded lot. They eventually seem to adjust as the greedy people they are and find spots more quickly. The curve down and up could represent the agent learning to find a spot, and then forgetting what to do when finding a spot. The learning rate is 0.1, so they do not need to learn a lot. In the HalfBusyLot, the agent seems to start out successfully in the lot with average positive rewards. Over time, the agent struggles with the MDP by going in and out of the negatives. The agent either finds a spot quickly or has to wait longer. They may crash sometimes. It makes sense in a lot like that for a less greedy learner because they have more patience, but less skills in how to park quickly. Being greedy is not the best choice; but, in a lot, it is the best thing to do if the agent wants to park quicker. The rewards become less negative which can indicate that the agent is learning and exploiting more. In the MostlyEmptyLot, the Q Learner seems to react mostly positive to the MDP. There is one point where the agent is negative, which could indicate a crash or having to wait longer. It makes sense with less people around, which means less stress and more parking spots. For the MuchLessGreedyQLearner, the MDP’s seem to cause an average performance other than how they react to the VerybusyLot. The MDP seems to cause a bad performance there. In this graph, the Q Learner seems to be crashing a lot more often than any of the other Q Learners. This is a result of them being less greedy when parking. At some points, the Q Learner has average rewards in the higher negatives, and this is the majority. They seem to react badly in a high busyrate. They do not have the confidence and toughness in fighting for parking spots in such a busy area. In the HalfBusyLot, the MDP has a bad effect on the Q Learner as well. There is a lot of crashing and waiting longer. There are points where the agent receives a positive
reward, but it is minimal in the graph. With less people around, it may be easier for them to find parking spots, but not completely. In the mostlyEmptyLot, the MDP still seems to have a bad effect on the Q Learner. While this is surprising, there are more positive average rewards that can be seen. The Q Learner still seems to crash often. The Q Learner could be exploring more which could lead to negative rewards. They may not react well learning from negative feedback or transitioning through policies. This Q Learner seemed to react the worst in all three environments compared to the other 2 Q Learners. Overall, the average rewards are mostly negative, and being less greedy seems to not work well with the MDP’s.