Adaptive Intelligent Mobile Robots

These videos represent the current state of the reinforcement learning
portion of the MARS project Adaptive Intelligent Mobile Robots (PI: Leslie Pack Kaelbling,
MIT), as of October 2001. Technical details of the JAQL system are
available here,
and a PhD thesis describing it in gory detail is also available (in PS and PDF). Questions about either of these
documents, the videos, or the demo should be directed to Bill Smart.
The Task
The task for this demo was to get the robot to learn to navigate
through a series of obstacles, while trying to reach a given goal
point. At the start of each experiment, the robot is
"shown" a goal location (using direct joystick control),
which is memorized by the odometry subsystem. We use odometry to
specify the goal because, for the purposes of this demo, it is easier
and more robust that specifying (and searching for) a physical target.
The speed of the robot is constant, and the learning system must try
to learn an appropriate rotation speed, based on the current inputs.
The position of the obstacles changes on every run (i.e. we are
not trying to learn a fixed path through the obstacles).
The RL details of the task are as follows. Reward is zero everywhere,
except at the goal state (1) and when hitting an obstacle (-1). This
is not the easiest reward function for this task, but was chosen to
make learning more challenging. A better reward function should make
learning easier and faster.
The robot is then driven to some other location, and must try to reach
the goal position again. The inputs to the learning system are the
heading and distance to the goal state, and the heading and distance
to any obstacles that are detected.
Obstacles are 3in wide posts that can be detected and isolated from
the background by the scanning laser rangefinder on the robot. A
seperate module identifies these obstacles, and returns the heading
and distance of each one to the learning system.
Initial (Phase One) Training
Initial, or phase one, training is done by directly driving the robot
with a joystick. The information gathered during this direct control
is used (on-line, as it is generated) to bootstrap information into
the reinforcement learning value function approximation. Phase one
learning was done both with and without obstacles. It is important to
note that we are not trying to get the best path in this training,
since the learning system is not trying to learn the paths that it is
being shown. Instead, paths that are less than perfect are actually
better for the learning system (see the thesis (PS, PDF )
for more details).
After each 5 training runs, the performance of the current best
policy was evaluated.
- After 5 phase one training runs
- After five training runs, the learned policy is good enough
to reach the goal, although the path that it follows is very
erratic.
- After 10 phase one training runs
- After 10 runs, the path to the goal is shorter, and considerably
smoother, although it is still not perfect.
Phase Two Learning
After phase one learning, the learning system takes control of the
robot, following the current best learned policy (with additional
exploration, as in a typical reinforcement learning system).
- After 10 phase two runs
- The robot still reaches the goal, slightly more smoothly than
at the end of phase one.
- After 10 phase two runs - failure
- Although the learning system continues to improve in general,
some starting positions still cause it to fail to get to the goal
exactly. In this case, it fails to come close enough to stop
at the goal, and begins to circle around it. Because of the
dynamics of the system, and the way in which the experiment
is designed, it is not possible to reach the goal from this
circling behavior.
- After 30 phase two learning runs
- At this point, the robot is able to reach the goal state smoothly
from any starting position. Note that the final behavior is
better than any of the phase one example trajectories (which
were not intended to be optimal).
- After 30 phase two learning runs
- A different starting position.
With Obstacles
This set of experiments used a single obstacle in the path of the
robot. This makes that task much harder, since a good policy must
now avoid this obstacle, while still driving twards the goal.
- After 5 phase one runs
- The robot has learned that obstacles are bad, but makes an
exaggerated turn in order to avoid it.
- After 10 phase one runs
- The turns are still exaggerated, but are becoming smaller.
- After 10 phase two runs
- The size of the detour that the robot makes continues to get
smaller during phase two.
- After 30 phase two runs
- Smaller still.
- After 40 phase two runs
- Finally, the robot learns to avoid the obstacle without making
a huge turn in the process. The trajectory to the goal state
has also become much smoother.
Two Obstacles
We're currently working on the two obstacle case. This is proving to
be significantly more difficult than one obstacle, mostly because
we're now working in a 7-dimensional space, rather than a
5-dimensional one. We have some
evidence of learning, where the robot differentiates between one
and two obstacles. In the first example, the robot turns to its right
to avoid the two obstacles. In the second example, the robot turns to
its left, since there is only one obstacle to be avoided, from a
similar starting position.
Bill Smart