These videos represent the current state of the reinforcement learning portion of the MARS project Adaptive Intelligent Mobile Robots (PI: Leslie Pack Kaelbling, MIT), as of October 2001. Technical details of the JAQL system are available here, and a PhD thesis describing it in gory detail is also available (in PS and PDF). Questions about either of these documents, the videos, or the demo should be directed to Bill Smart.

The Task

The task for this demo was to get the robot to learn to navigate through a series of obstacles, while trying to reach a given goal point. At the start of each experiment, the robot is "shown" a goal location (using direct joystick control), which is memorized by the odometry subsystem. We use odometry to specify the goal because, for the purposes of this demo, it is easier and more robust that specifying (and searching for) a physical target. The speed of the robot is constant, and the learning system must try to learn an appropriate rotation speed, based on the current inputs. The position of the obstacles changes on every run (i.e. we are not trying to learn a fixed path through the obstacles). The RL details of the task are as follows. Reward is zero everywhere, except at the goal state (1) and when hitting an obstacle (-1). This is not the easiest reward function for this task, but was chosen to make learning more challenging. A better reward function should make learning easier and faster. The robot is then driven to some other location, and must try to reach the goal position again. The inputs to the learning system are the heading and distance to the goal state, and the heading and distance to any obstacles that are detected. Obstacles are 3in wide posts that can be detected and isolated from the background by the scanning laser rangefinder on the robot. A seperate module identifies these obstacles, and returns the heading and distance of each one to the learning system.

Initial (Phase One) Training

Initial, or phase one, training is done by directly driving the robot with a joystick. The information gathered during this direct control is used (on-line, as it is generated) to bootstrap information into the reinforcement learning value function approximation. Phase one learning was done both with and without obstacles. It is important to note that we are not trying to get the best path in this training, since the learning system is not trying to learn the paths that it is being shown. Instead, paths that are less than perfect are actually better for the learning system (see the thesis (PS, PDF ) for more details).

After each 5 training runs, the performance of the current best policy was evaluated.

After 5 phase one training runs: After five training runs, the learned policy is good enough to reach the goal, although the path that it follows is very erratic.
After 10 phase one training runs: After 10 runs, the path to the goal is shorter, and considerably smoother, although it is still not perfect.

Phase Two Learning

After phase one learning, the learning system takes control of the robot, following the current best learned policy (with additional exploration, as in a typical reinforcement learning system).

After 10 phase two runs: The robot still reaches the goal, slightly more smoothly than at the end of phase one.
After 10 phase two runs - failure: Although the learning system continues to improve in general, some starting positions still cause it to fail to get to the goal exactly. In this case, it fails to come close enough to stop at the goal, and begins to circle around it. Because of the dynamics of the system, and the way in which the experiment is designed, it is not possible to reach the goal from this circling behavior.
After 30 phase two learning runs: At this point, the robot is able to reach the goal state smoothly from any starting position. Note that the final behavior is better than any of the phase one example trajectories (which were not intended to be optimal).
After 30 phase two learning runs: A different starting position.

With Obstacles

This set of experiments used a single obstacle in the path of the robot. This makes that task much harder, since a good policy must now avoid this obstacle, while still driving twards the goal.

After 5 phase one runs: The robot has learned that obstacles are bad, but makes an exaggerated turn in order to avoid it.
After 10 phase one runs: The turns are still exaggerated, but are becoming smaller.
After 10 phase two runs: The size of the detour that the robot makes continues to get smaller during phase two.
After 30 phase two runs: Smaller still.
After 40 phase two runs: Finally, the robot learns to avoid the obstacle without making a huge turn in the process. The trajectory to the goal state has also become much smoother.

Two Obstacles

We're currently working on the two obstacle case. This is proving to be significantly more difficult than one obstacle, mostly because we're now working in a 7-dimensional space, rather than a 5-dimensional one. We have some evidence of learning, where the robot differentiates between one and two obstacles. In the first example, the robot turns to its right to avoid the two obstacles. In the second example, the robot turns to its left, since there is only one obstacle to be avoided, from a similar starting position.

Bill Smart