Lecture 23: Program Synthesis with Reinforcement Learning

In the previous lecture, we saw how a neural network could be trained using supervised learning to produce a given program from a given specification. That approach to synthesis requires a corpus that includes both specifications and their resulting programs in order for the network to learn how to map from one to the other. An alternative approach is to use reinforcement learning (RL). In RL, the idea is that instead of a dataset describing the output for every input, the learner has access to an environment that produces rewards or penalties for every action of the learner. This allows for a form of self-training where the learner discovers the best actions to take

A reinforement learning primer

In reinforcement learning, the goal is to learn how to control an agent interacting with a possibly non-deterministic environment. The environment is defined by a state space $\mathcal{S}$ and a transition function $T : S \times \mathcal{A} \times S \rightarrow \mathbb{R}$. Given a source state $s$ an action $a$ and a target state $s'$, the function $T(s, a, s')$ is equal to the probability that if the agent in a state $s$ and it performs an action $a$, that it will transition to state $s'$. In order for $T$ to be well formed, it must be the case that for all initial states and all actions, the sum of $T$ over all target states should add up to 1. \[ \forall s. \forall a. \left(\sum_{s'} T(s, a, s') \right) = 1 \]

In most practical applications of RL, the environment is represented with a simulator— a complex piece of code that is not differentiable or analyzable and that internally keeps track of the state of the system. For example, if we want to train an agent to play pong, the state would be the position of the ball and the paddles. The actions would correspond to the commands to move the paddle up or down, and the transition function would be the simulator that determines the next state from the current state and the action. The uncertainty in pong comes from the adversary, which may behave non-deterministically.

The goal in RL is to learn a policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ that for each state determines an action for the agent and which maximizes a reward $R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ that assigns a score to each state action pair. In most interesting applications of RL, the reward is sparse, only a handful of state/actions have non-zero rewards. For example, in pong, you may get a reward every time your paddle hits the ball. In the case of some games, the reward may be zero until you reach the end of the game—only at that point you get a positive or negative reward depending on whether you won or lost.

In learning a policy, it is often useful to compute a value function. A value function $V_\pi(s)$ computes the expected reward if you follow a policy $\pi$ starting from state $s$. For an infinite horizon game—a game that you expect to play forever, or at least for a very long time —it is common to compute a discounted reward. This means that a reward gets multiplied by a value $\gamma < 1$ after every timestep, so rewards far into the future are less valuable than immediate rewards. This means the value function will satisfy the following recurrence relation: \[ V_\pi(s) = R(s, \pi(s)) + \gamma * \left( \sum_{s'} T(s, \pi(s), s')* V_\pi(s') \right) \] So the reward at state $s$ is the reward for the action we take at state $s$ plus the discounted reward from the states we transition to from $s$.

Another useful function when learning a policy is the action value function, also often refered to as the Q function. $Q_\pi: \mathcal{S} \times \mathcal{A} \mathbb{R}$. The function $Q_\pi(s, a)$ computes the expected reward if starting from state $s$, you first take action $a$ and subsequently just follow the policy $\pi$. \[ Q_\pi(s, a) = R(s, a) + \gamma * \left( \sum_{s'} T(s, a, s')* V_\pi(s') \right) \]

When $T$ is deterministic, we can use the shorthand $T(s,a)$ to refer to the state from which $T$ transitions from $s$ on action $a$ with probability one. In that case, $V$ and $Q$ simplify to \[ V_\pi(s) = R(s, \pi(s)) + \gamma * V_\pi(T(s, \pi(s))) \\ Q_\pi(s, a) = R(s, a) + \gamma * V_\pi(T(s, a)) \] There are different flavors of reinforcement learning, but generally the goal is to learn the policy $\pi$ and the value function $V_\pi$ in order to effectively control the agent and guide it to a good reward.

Basic MonteCarlo Tree Search (MCTS) and AlphaZero

One of the best known instances of deep reinforcement learning is AlphaZeroAlphaZero, which builds on an algorithm known as MonteCarlo Tree Search (MCTS)MCTS. The goal of AlphaZero is to find a probabilistic policy $P: \mathcal{S} \rightarrow \mathcal{A} \rightarrow \mathbb{R}$ that for every state $s$ and every action $a$, produces a value between zero and one representing the probability of taking action $a$ at state $s$. So for example, we can compute a policy $\pi$ that always takes the highest probability policy as $\pi(s)=argmax_{a} P_\theta(s,a)$, although as we shall see, the computation of the actual actions to take at a given state will be a little more involved. Additionally, the algorithm will compute a value function $V_\pi(s)$ that estimates the reward that the policy will achieve starting at state $s$.

The algorithm is based on deep learning, so the probabilistic policy and value functions will be represented by parametric functions $P_\theta$ and $V_\theta$. In practice, $P$ and $V$ are represented by a single neural network and $\theta$ is a very large collection of parameters that controls both functions, but that is a low-level implementation detail. The goal of the learning algorithm is to discover a good parameter $\theta$ such that the policy achieves high rewards and the Value function $V_\theta$ accurately predicts the reward of following the policy for the reminder of the game starting at state $s$.

The key building block in the algorithm is MCTS. MCTS performs some local exploration in the neighborhood of a state to discover improved estimates of $P$ and $V$ which can then be used to adjust $\theta$. <p>Your browser does not support iframes.</p> The basics of MCTS are illustrated in the figure. In the algorithm, you start at a state $s_0$ and iteratively search the space guided by an approximation of the $Q$ function built on the fly. The algorithm maintains a map representing the Q function $Q(s,a)$ and a counter $N(s,a)$ that tracks how many times action $a$ has been taken from state $s$. We use the shorthand $N(s)$ to refer to the total number of times state $s$ has been visited $N(s) = \sum_a N(s,a)$.

The algorithm is traditionally explained in terms of four phases:

Selection. Traverses a path through the visited nodes in search of a node at the boundary between visited and not-visited nodes. In searching for this path, the algorithm uses the following recursive formula: \[ search (s) = \begin{cases} search(T(s,action))~\mbox{where}~action = argmax_a \left( Q(s,a) + C*P(s,a)*\frac{\sqrt{N(s)}}{1+N(s,a)} \right) \\ \\ s ~~\mbox{if s has unvisited children} \\ \end{cases} \] In other words, in taking an action for each state where all its children have been visited, it attempts to maximize a quantity that considers both the $Q$ value of the proposed action, as well a measure of how desireable the current policy regards action $a$ adjusted by a measure of how many times the current action has been explored relative to other actions available at this level. The constant $C$ is used to give more or less weight to the second component relative to the first.

Expansion. Once selection reaches a boundary between visited and non-visited nodes, a new node is selected that has not been visited before and its $Q$ and $N$ measures are initialized.

Simulation. In this step, the system computes an estimate of the value of the currently expanded node. This can be done in many different ways, but in the case of AlphaGo/AlphaZero, the value is red directly from the last version of the value function, unless the end of the game has been reached, in which case the game is just scored explicitly.

Backpropagation Once a value has been computed for the node, this value can be used to compute an estimate of $Q$ for each node in the path from the origin to the expanded node. For example, if values are not discounted over time (i.e. $\gamma=1$), then this just means updating $Q(s,a)$ by adjusting the old value of $Q$ to $(Q(s,a)*N(s,a) + V)/(N(s,a)+1)$.

Those steps are repeated over and over again a few hundred times. After this is done enough times, the function $Q$ can be used to provide an improved estimate of both the value function and the policy for the nodes visited during search. The parameters for $P$ and $V$ can then be adjusted to get them closer to these improved estimates. By iteratively repeating this process of running MCTS and then adjusting the parameters to more closely match the $P$ and $V$ suggested by the $Q$ function, the algorithm converges to a good policy and value function.

Reinforcement learning for Program Synthesis

The simplest way to formulate program synthesis as a reinforcement learning problem is to define the state space $S$ as the set of partial programs and the actions correspond to growing the program. For example, in the context of top-down search, an action correspond to selecting a hole in the program and expanding it using one of the available rules. For languages where programs correspond to linear sequences of instructions, an action can simply correspond to appending an instruction at the end of the partially constructed program. In this context, the reward is sparse, all states corresponding to partial programs have a reward of zero, and states corresponding to completed programs have positive or negative rewards depending on whether the program is correct or incorrect.

The first paper to take this view was written by Bunel, Hausknecht, Devlin, Singh and KohliBunelHDSK18. Their result was improved by Chen, Liu and SongChenLS19 who observed that by incorporating the state of the execution as part of the state, you could leverage the program interpreter to give the policy a better sense of what the program so far can actually compute.

<p>Your browser does not support iframes.</p> In this paper, we focus on the 2019 paper by Ellis, Nye, Pu, Sosa, Tenenbaum and Solar-Lezama EllisNPSTS19 which was the first to directly apply the ideas of AlphaGo for program synthesis. Similar to Chen et al. the paper also uses the program state as part of the state representation, but it goes one step further by not including the program text at all. The intuition for this is the same as the intuition behind observational equivalence from Lecture 3—programs that produce the same outputs on the given inputs do not have to be distinguished, and collapsing them into an equivalence class helps with symmetries. Thus, in this formulation, the state $s$ is the state computed by the program so far, and an action is an instruction that transforms this state to a new state. Like AlphaGo, the goal is to learn both a policy and a value function. The approach is simpler than AlphaGo in that it uses something simpler than MCTS to compute the updates to the policy and value functions.

This paper instead first uses imitation learning to pre-train a policy to get an initial policy and then it uses reinforcement learning to improve the policy by computing rollouts of the policy and then adjusting the policy and value functions based on the reward computed for that rollout. The resulting policy and value functions can then be used to search the space more efficiently.

Introduction to Program Synthesis

Lecture 23: Program Synthesis with Reinforcement Learning

A reinforement learning primer

Basic MonteCarlo Tree Search (MCTS) and AlphaZero

Reinforcement learning for Program Synthesis