Homework 1: Cliff walking

Due 2/13/2012.

This assignment is to code the Cliff Walking example (Example 6.6) in Sutton and Barto, "Reinforcement Learning, an Introdution", Chapter 6. You should solve the problem using both SARSA and Q-learning. Use e-greedy exploration with epsilon=0.1 (the agent takes a random action 10 percent of the time in order to explore.)

The programming should be done in MATLAB. Students may get access to MATLAB here. Alternatively, students may code in Python (using Numpy). If the student would rather code in a different language, please see Dr Platt.

Students should submit their homework via email to the TA (zihechen@buffalo.edu) in the form of a ZIP file that includes the following:

1. A PDF of a plot of gridworld that illustrates the paths found by q-learning and SARSA. It should look like the diagram in Figure 6.13 in SB.

2. A PDF of a plot of reward per episode for SARSA and q-learning.

3. A text file showing output from a sample run of your code.

4. A directory containing all source code for your project.

Update

The windly cliffworld section of the online book does not specify a different reward in the goal state compared with the reward in other states. This is because, in principle, it doesn't matter what the reward in the goal state is. The agent gets a -1 reward on each time step prior to reaching the goal state. As a result, the agent should learn to act so as to minimize the time it takes to reach the goal state. If you want, it is acceptable to add an extra positive reward for reaching the goal state. However, this is not necessary.