Next: Introduction
Bounds on sample size for policy evaluation in Markov environments
Leonid Peshkin and Sayan Mukherjee
MIT Artificial Intelligence Laboratory
545 Technology Square
Cambridge, MA 02139
{pesha,sayan}@ai.mit.edu
Abstract:
Reinforcement learning means finding the optimal course of action
in Markovian environments without knowledge of the environment's dynamics.
Stochastic optimization algorithms used in the field rely on estimates of the
value of a policy. Typically, the value of a policy is estimated from results
of simulating that very policy in the environment. This approach requires a
large amount of simulation as different points in the policy space are
considered. In this paper, we develop value estimators that utilize data
gathered when using one policy to estimate the value of using another policy,
resulting in much more data-efficient algorithms. We consider the question of
accumulating a sufficient experience and give PAC-style bounds.
Leonid Peshkin
2003-09-24