Nonparametric Bayesian Approaches for Reinforcement Learning in Partially Observable Environments

The partially-observable Markov decision process (POMDP) framework has been successful in many planning domains where the agent must choose between actions that maximize immediate reward and actions that provide information about the environment. Unfortunately, POMDPs are defined by a large number of parameters that are difficult to set a priori; gathering enough training data may also be prohibitively expensive. Thus, it is most realistic to consider that an agent acting in a partially observable environment will have to do some learning, that is, determine some of the world's properties in an online manner through its interactions with the environment.

Reinforcement learning in partially observable domains is a difficult problem, however, as the agent never has access to the true state of the environment. In these situations, Bayesian approaches are useful as they help the agent make decisions based on distributions of possible environmental models (instead of just considering a single model). The priors associated with Bayesian techniques typically guide the use of the data to learn models more quickly than other approaches. Nonparametric approaches can alleviate the issues of model size by allowing the model to grow as more data is observed. Growing complexity in a data-directed fashion is particularly attractive for online settings where computational costs are an important consideration; nonparametric approaches can help the agent ignore parameters for which it has no data to model.

While often data-efficient, the use of Bayesian nonparametric (and often just Bayesian) approaches to reinforcement learning have seen limited use due to their high computational complexity. The contributions of this work are two-fold: First, we will develop Bayesian nonparametric models that are particularly suited for reinforcement learning applications. Second, we will develop efficient, online algorithms for using these models that will allow these models to be applied to realworld scenarios. The expected contributions are outlined below:

- Model-based Learning with Unbounded State and Observation Representations (iPOMDP)
- Action Selection: when should model-variance reducing actions be used? (BEB style approximations?)
- Belief Update/Inference: fast, online methods for resampling iHMMs/HMMs
- Active learning: when is the best time to ask for rewards, if they are not always given?
- Can this idea extend to proposals for feature selection in more conventional planners?

- Model-free Learning with Unbounded Policy Representations (in the planning as inference framework)
- Adapting approach to an online setting
- Demonstrating use on a variety of active learning (when should you ask for policy help?) and imitation learning (how can we take advantage of a few demonstrated trajectories?)

- Limited Precision Priors for faster learning
- Prior bias is towards low precision models, precision added as needed (compare to something like PY models for the iPOMDP, right now using a DP)
- Developing efficient method for inference?
- Managing the model-space exploration/exploitation trade-off (unlike with the DP based iPOMDP, these models probably won't just learn with very little prodding; however, they probably are easier to do MH over)
- Can we leverage fast algorithms for near-deterministic POMDPs (I think this is relatively doubtful, but worth looking into… I think most of benefit for these priors will come from faster learning rates, not faster planning rates.)