Introduction to Program Synthesis

© Armando Solar-Lezama. 2022. All rights reserved. © Theo X. Olausson. 2025. All rights reserved.

TODO:

Changelog:

Lecture 14: Program Synthesis with Reinforcement Learning

In the preceding lectures, we have seen how to use language models to produce programs from specifications. We didn't talk much about how these models are trained in the first place, but that was simply because the training process is not itself that interesting: we essentially have a (incredibly vast) corpus of text, i.e. strings $t_1 \ldots t_n$, and we use supervised learning to maximize $p_\theta(t_i | t_1 \ldots t_{i-1})$ for all $i$. The fact that training the model in this simplistic fashion on large mixes of natural language and programs elicits a general capacity to turn textual specifications into programs is a remarkable event, but unfortunately it is not the focus of this course.

There is however another way to train a model to produce programs, and that is by using reinforcement learning (RL). Unlike the supervised learning approach, RL does not require a corpus of programs to train on. Instead, the idea is that the model has access to an environment that produces rewards or penalties for every action that the model takes. This allows for a form of self-training, where the model discovers the best (i.e., most rewarding) actions to take in different situations.

A reinforement learning primer

In reinforcement learning, the goal is to learn how to control an agent interacting with a possibly non-deterministic environment. The environment is defined by a state space $\mathcal{S}$ and a transition function $T : S \times \mathcal{A} \times S \rightarrow \mathbb{R}$. Given a source state $s$ an action $a$ and a target state $s'$, the function $T(s, a, s')$ is equal to the probability that if the agent in a state $s$ and it performs an action $a$, that it will transition to state $s'$. In order for $T$ to be well formed, it must be the case that for all initial states and all actions, the sum of $T$ over all target states should add up to 1. \[ \forall s. \forall a. \left(\sum_{s'} T(s, a, s') \right) = 1 \]

In most practical applications of RL, the environment is represented with a simulator— a complex piece of code that is not differentiable or analyzable and that internally keeps track of the state of the system. For example, if we want to train an agent to play pong, the state would be the position of the ball and the paddles. The actions would correspond to the commands to move the paddle up or down, and the transition function would be the simulator that determines the next state from the current state and the action. The uncertainty in pong comes from the adversary, which may behave non-deterministically.

The goal in RL is to learn a policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ that for each state determines an action for the agent and which maximizes a reward $R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ that assigns a score to each state action pair. In most interesting applications of RL, the reward is sparse, only a handful of state/actions have non-zero rewards. For example, in pong, you may get a reward every time your paddle hits the ball. In the case of some games, the reward may be zero until you reach the end of the game—only at that point you get a positive or negative reward depending on whether you won or lost.

In learning a policy, it is often useful to compute a value function. A value function $V_\pi(s)$ computes the expected reward if you follow a policy $\pi$ starting from state $s$. For an infinite horizon game—a game that you expect to play forever, or at least for a very long time —it is common to compute a discounted reward. This means that a reward gets multiplied by a value $\gamma < 1$ after every timestep, so rewards far into the future are less valuable than immediate rewards. This means the value function will satisfy the following recurrence relation: \[ V_\pi(s) = R(s, \pi(s)) + \gamma * \left( \sum_{s'} T(s, \pi(s), s')* V_\pi(s') \right) \] So the reward at state $s$ is the reward for the action we take at state $s$ plus the discounted reward from the states we transition to from $s$.

Another useful function when learning a policy is the action value function, also often refered to as the Q function. $Q_\pi: \mathcal{S} \times \mathcal{A} \mathbb{R}$. The function $Q_\pi(s, a)$ computes the expected reward if starting from state $s$, you first take action $a$ and subsequently just follow the policy $\pi$. \[ Q_\pi(s, a) = R(s, a) + \gamma * \left( \sum_{s'} T(s, a, s')* V_\pi(s') \right) \]

When $T$ is deterministic, we can use the shorthand $T(s,a)$ to refer to the state from which $T$ transitions from $s$ on action $a$ with probability one. In that case, $V$ and $Q$ simplify to \[ V_\pi(s) = R(s, \pi(s)) + \gamma * V_\pi(T(s, \pi(s))) \\ Q_\pi(s, a) = R(s, a) + \gamma * V_\pi(T(s, a)) \] There are different flavors of reinforcement learning, but generally the goal is to learn the policy $\pi$ and the value function $V_\pi$ in order to effectively control the agent and guide it to a good reward.

Basic MonteCarlo Tree Search (MCTS) and AlphaZero

One of the best known instances of deep reinforcement learning is AlphaZeroAlphaZero, which builds on an algorithm known as MonteCarlo Tree Search (MCTS)MCTS. The goal of AlphaZero is to find a probabilistic policy $P: \mathcal{S} \rightarrow \mathcal{A} \rightarrow \mathbb{R}$ that for every state $s$ and every action $a$, produces a value between zero and one representing the probability of taking action $a$ at state $s$. So for example, we can compute a policy $\pi$ that always takes the highest probability policy as $\pi(s)=argmax_{a} P_\theta(s,a)$, although as we shall see, the computation of the actual actions to take at a given state will be a little more involved. Additionally, the algorithm will compute a value function $V_\pi(s)$ that estimates the reward that the policy will achieve starting at state $s$.

The algorithm is based on deep learning, so the probabilistic policy and value functions will be represented by parametric functions $P_\theta$ and $V_\theta$. In practice, $P$ and $V$ are represented by a single neural network and $\theta$ is a very large collection of parameters that controls both functions, but that is a low-level implementation detail. The goal of the learning algorithm is to discover a good parameter $\theta$ such that the policy achieves high rewards and the Value function $V_\theta$ accurately predicts the reward of following the policy for the reminder of the game starting at state $s$.

The key building block in the algorithm is MCTS. MCTS performs some local exploration in the neighborhood of a state to discover improved estimates of $P$ and $V$ which can then be used to adjust $\theta$. The basics of MCTS are illustrated in the figure. In the algorithm, you start at a state $s_0$ and iteratively search the space guided by an approximation of the $Q$ function built on the fly. The algorithm maintains a map representing the Q function $Q(s,a)$ and a counter $N(s,a)$ that tracks how many times action $a$ has been taken from state $s$. We use the shorthand $N(s)$ to refer to the total number of times state $s$ has been visited $N(s) = \sum_a N(s,a)$.

The algorithm is traditionally explained in terms of four phases:

Selection. Traverses a path through the visited nodes in search of a node at the boundary between visited and not-visited nodes. In searching for this path, the algorithm uses the following recursive formula: \[ search (s) = \begin{cases} search(T(s,action))~\mbox{where}~action = argmax_a \left( Q(s,a) + C*P(s,a)*\frac{\sqrt{N(s)}}{1+N(s,a)} \right) \\ \\ s ~~\mbox{if s has unvisited children} \\ \end{cases} \] In other words, in taking an action for each state where all its children have been visited, it attempts to maximize a quantity that considers both the $Q$ value of the proposed action, as well a measure of how desireable the current policy regards action $a$ adjusted by a measure of how many times the current action has been explored relative to other actions available at this level. The constant $C$ is used to give more or less weight to the second component relative to the first.

Expansion. Once selection reaches a boundary between visited and non-visited nodes, a new node is selected that has not been visited before and its $Q$ and $N$ measures are initialized.

Simulation. In this step, the system computes an estimate of the value of the currently expanded node. This can be done in many different ways, but in the case of AlphaGo/AlphaZero, the value is red directly from the last version of the value function, unless the end of the game has been reached, in which case the game is just scored explicitly.

Backpropagation Once a value has been computed for the node, this value can be used to compute an estimate of $Q$ for each node in the path from the origin to the expanded node. For example, if values are not discounted over time (i.e. $\gamma=1$), then this just means updating $Q(s,a)$ by adjusting the old value of $Q$ to $(Q(s,a)*N(s,a) + V)/(N(s,a)+1)$.

Those steps are repeated over and over again a few hundred times. After this is done enough times, the function $Q$ can be used to provide an improved estimate of both the value function and the policy for the nodes visited during search. The parameters for $P$ and $V$ can then be adjusted to get them closer to these improved estimates. By iteratively repeating this process of running MCTS and then adjusting the parameters to more closely match the $P$ and $V$ suggested by the $Q$ function, the algorithm converges to a good policy and value function.

Reinforcement learning for Program Synthesis

The simplest way to formulate program synthesis as a reinforcement learning problem is to define the state space $S$ as the set of partial programs and the actions correspond to growing the program. For example, in the context of top-down search, an action correspond to selecting a hole in the program and expanding it using one of the available rules. For languages where programs correspond to linear sequences of instructions, an action can simply correspond to appending an instruction at the end of the partially constructed program. In this context, the reward is sparse, all states corresponding to partial programs have a reward of zero, and states corresponding to completed programs have positive or negative rewards depending on whether the program is correct or incorrect.

The first paper to take this view was written by Bunel, Hausknecht, Devlin, Singh and KohliBunelHDSK18. Their result was improved by Chen, Liu and SongChenLS19 who observed that by incorporating the state of the execution as part of the state, you could leverage the program interpreter to give the policy a better sense of what the program so far can actually compute.

In this paper, we focus on the 2019 paper by Ellis, Nye, Pu, Sosa, Tenenbaum and Solar-Lezama EllisNPSTS19 which was the first to directly apply the ideas of AlphaGo for program synthesis. Similar to Chen et al. the paper also uses the program state as part of the state representation, but it goes one step further by not including the program text at all. The intuition for this is the same as the intuition behind observational equivalence from Lecture 3—programs that produce the same outputs on the given inputs do not have to be distinguished, and collapsing them into an equivalence class helps with symmetries. Thus, in this formulation, the state $s$ is the state computed by the program so far, and an action is an instruction that transforms this state to a new state. Like AlphaGo, the goal is to learn both a policy and a value function. The approach is simpler than AlphaGo in that it uses something simpler than MCTS to compute the updates to the policy and value functions.

This paper instead first uses imitation learning to pre-train a policy to get an initial policy and then it uses reinforcement learning to improve the policy by computing rollouts of the policy and then adjusting the policy and value functions based on the reward computed for that rollout. The resulting policy and value functions can then be used to search the space more efficiently.

Reinforcement Learning for LLM Post-Training

In the above discussion, we have focused on how to use reinforcement learning to train a program synthesis model from scratch on a specific task. You may wonder how this relates to the use of LLMs for program synthesis, which has been the focus of the preceding lectures. The truth is that for a long time, little success was had in marrying these two techniques. However, only in the last year or so, scale (and increasingly capable base models) have suddenly made it a viable possibility.

You may have heard of "thinking models" like OpenAI's O-series models, Google's recent Gemini 2.5 Pro, or DeepSeek-R1. These models are all believed to have been trained using essentially the same, simple formulation of reinforcement learning. The state space $S$ (TODO double check this is what Armando uses above) is the set of all strings of tokens of finite length; formally, if $\mathcal{V}$ is the vocabulary of the model, then $S = \mathcal{V}^*$. The action space $\mathcal{A}$ is simply the set of all tokens in the vocabulary $\mathcal{V}$. The transition function $T$ is defined as $T(s, a, s') = 1$ if $s'$ is the result of appending token $a$ to string $s$, and zero otherwise; more legibly, the transition function deterministically appends the chosen token (action) to the current string (the state). The models only differ substantially in how they define the reward function $R$, and indeed different tasks call for different reward functions. In many NLP tasks, the reward function has to somehow mimic human preferences, which is not an easy feat to accomplish. Fortunately, as far as sythesis is concerned, the reward function can in many cases be objectively defined as the correctness (as measured against some unit tests, or a set of pre- and post-conditions) of the program.

It is tempting to look at the above formulation and say that it is obvious; perhaps you too found yourself surprised that this has only become a popular paradigm in the last year or so. However, it is important to impress on yourself how remarkable it is that this simple formulation can be used to fit models with hundreds of billions of parameters, acting in state spaces so large that they are impossible to enumerate, using reward functions that are so sparse they may only have a non-zero value for a single state. Meta's Llama-3 has a vocabulary size of $128,000$ tokens; even limiting the state space to strings of length 100, we have a state space of size $128,000^{100} \approx 10^{500}$, which is about five orders of magnitude larger than the number of atoms in the observable universe. How many of these $128,000^{100}$ strings are actually valid programs? How many of those programs implement the exact specification you requested?

Getting this to work is a remarkable feat of engineering and a testament to the power of scale, but more so than anything else, it requires starting with a model that is already very, very capable of understanding language and programs.

Interested readers are encouraged to read rlhf2024 and murphy2025reinforcementlearningoverview for more details on this quickly evolving field.