Doctoral dissretation by Leon Peshkin

``Reinforcement Learning by Policy Search''

Adviser: Prof. Leslie Kaelbling (MIT)
Reader: Prof. John Tsitsiklis (MIT)
Reader: Prof. Tomas Dean (Brown U.)

The most recent edition: (PS, PDF), titlepage PDF;
Abstract: PDF,
(translations to Italian (PDF, PS) and Russian (KPATKOE PYCCKOE BBEDENUE)).
One objective of artificial intelligence is to model the behavior of an
intelligent agent interacting with its environment. The environment's
transformations could be modeled as a Markov chain, whose state is partially
observable to the agent and affected by its actions; such processes are known
as partially observable Markov decision processes (POMDPs). While the
environment's dynamics are assumed to obey certain rules, the agent does not
know them and must learn.

In this dissertation we focus on the agent's adaptation as captured by the
reinforcement learning framework. Reinforcement learning means learning a
policy---a mapping of observations into actions---based on feedback from the
environment. The learning can be viewed as browsing a set of policies while
evaluating them by trial through interaction with the environment.

The set of policies being searched is constrained by the architecture of the
agent's controller. POMDPs require a controller to have a memory. We
investigate various architectures for controllers with memory, including
controllers with external memory, finite state controllers and distributed
controllers for multi-agent system. For these various controllers we work out
the details of the algorithms which learn by ascending the gradient of
expected cumulative reinforcement. 

Building on statistical learning theory and experiment design theory, a
policy evaluation algorithm is developed for the case of experience re-use.
We address the question of sufficient experience for uniform convergence of
policy evaluation and obtain sample complexity bounds for various estimators.
Finally, we demonstrate the performance of the proposed algorithms on several
domains, the most complex of which is simulated adaptive packet routing in a
telecommunication network.

Keywords: Markov processes, MDP, POMDP, policy search, gradient methods, 
reinforcement learning, adaptive systems, stochastic control, adaptive behavior, 
importance sampling, optimization, decision making under uncertainty.

Small Print

The documents distributed here have been provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder. ( Notice borrowed from Sham Kakade who borrowed it from Peter Dayan who borrowed it from Dave Plaut ).