TODO:
Add discussion of SMC, alongside MCMC.
Changelog:
- 6/12/25: Fixed typos throughout.
- 6/12/25: Added section on SMC.
Lecture 5: Inductive Synthesis with Stochastic Search
Back in 2013, a group at Stanford started publishing a series of papers where they showed the potential of using stochastic search techniques to attack complex synthesis problems that were beyond the capabilities of other synthesis methods. The first of these papers was published in ASPLOS 2013Schkufza0A13 by Eric Schkufza, Rahul Sharma and Alex Aiken, and it demonstrated the use of stochastic search for superoptimization problems, where the goal is to find a sequence of X86 instructions that are close to optimal for a particular task. The basis for a lot of this work is a stochastic search technique known as Markov Chain Monte Carlo (MCMC). This technique also forms the basis of the search algorithms used by many probabilistic programming languages such as Church. Back in 2009, Persi Diaconis wrote a very nice tutorial on MCMC that provides a good introduction to the areaDiaconis09. In this lecture we will cover some of the highlights, starting with some notation for Markov Chains, before moving into it's application in program synthesis. Towards the end of the lecture, we will also briefly introduce a related technique called Sequential Monte Carlo (SMC). This has not seen much use in inductive syntheis per se, but it has recently become relevant in the context of large language models. We will see more of this in Unit 2.Markov Chains
Stationary distributions
Search with Metropolis-Hastings
The fundamental theorem of Markov Chains suggests an approach for synthesizing programs. First, let $\chi$ be the space of programs. We are going to engineer a $K(x,y)$ such that $\pi(x)$ is high for “good programs” and low for “bad programs”. Then we can pick a random start state $x_0$ and simulate the markov process for $n$ steps for some large $n$. By the fundamental theorem, the probability that $x_n$ is a good program will be higher than the probability that it is a bad program. The key for this strategy to work, then, is to develop a $K(x,y)$ that has exactly the right properties. The key step for the above approach to work is to engineer a transition matrix $K$ that has a desired $\pi(x)$ as a stationary distribution. The most popular approach is the Metropolis-Hastings algorithm (MH), first presented in a completely different context by Metropolis, Rosenbluth, Rosenblith, Teller and TellerMetropolis53 and later generalized by Wilfred Hastings. The starting point for the Metropolis-hastings approach is a Markov matrix $J(x,y)$ that is known as a proposal distribution. The only formal requirement on the proposal distribution is that $J(x,y)>0$ iff $J(y,x)>0$, although in practice the specific form of the proposal distribution can have a big impact on how fast the algorithm converges. Starting with the proposal distribution $J(x,y)$ and the desired stationary distribution $\pi(x)$, the approach defines a quantity $A(x,y)=\frac{\pi(y)*J(y,x)}{\pi(x)*J(x,y)}$ it calls the acceptance ratio, and from that it defines the distribution $K(x,y)$ as follows.
\[
K(x,y) =
\begin{cases}
J(x,y) & \mbox{if } x \neq y \mbox{ and } A(x,y) \geq 1 \\
J(x,y)*A(x,y) & \mbox{if } x \neq y \mbox{ and } A(x,y) < 1 \\
J(x,y)+\sum_{z:A(x,z)<1} J(x,z)(1-A(x,z)) & \mbox{if } x = y
\end{cases}
\]
The argument for why this $K(x,y)$ will have the desired stationary distribution
$\pi(x)$ is relatively straightforward. The key observation is
that $\pi(x)*K(x,y)=\pi(y)*K(y,x)$. From that
observation it follows that $\sum_{x} \pi(x)*K(x,y)=\sum_{x} \pi(y)*K(y,x) = \pi(y)\sum_{x} K(y,x)= \pi(y)$,
so $\pi = K\pi$.
An important advantage of this definition of $K$ is that it
is relatively easy to sample from this distribution, assuming
we can easily sample from the proposal distribution $J$.
Given a state $x$, the approach is to take a sample from the
distribution $J(x,y)$, and compute the value of $A(x,y)$ by comparing
the value of $\pi(x)$ with the value of $\pi(y)$. If $A(x,y)>1$,
which usually means that $\pi(y)$ is better than $\pi(x)$, then
we transition to state $y$. If $A(x,y)$ is worse, then we transition
to $y$ with probability $A(x,y)$ and otherwise we stay in $x$.
Another important feature of the MH approach is that $\pi$
only appears as a ratio $\pi(y)/\pi(x)$. This means that
the algorithm can work even if the function $\pi$ is not
normalized. In fact, most approaches just use a scoring
function as opposed to a distribution.
Metropolis-Hastings for Program Synthesis
Applying the MH approach to program synthesis boils down to defining the program space, defining a desired stationary distribution $\pi$ and a proposal distribution $J$. Given a programming by example problem, it is tempting to simply define $J$ as a uniform distribution (so transitioning from any program to any other program has the same probability) and $\pi$ as having probability $1$ for the correct program and zero for every other program. Unfortunately, it's not so simple. For one, the $K$ would not be well defined, since the MH approach would lead to a division by zero (note that the requirement of the fundamental theorem of markov chains implies that $\pi(x)$ cannot be zero for any $x$). We could make $\pi(x)$ for incorrect programs $x$ be some $\epsilon>0$, so that at least we satisfy the formal requirements of the algorithm, but even that would not be particularly effective as a search mechanism. Under such a distribution, the search would devolve to randomly generating programs and testing them against the examples; hardly an efficient search strategy. In order for MH to actually be effective as a search strategy, we need a stationary distribution $\pi$ that allows us to tell whether a program is getting closer to being correct, and we need a proposal distribution $J$ that gives priority to programs whose behavior is going to be similar to the current program, so that we do not easily lose track of what we had already learned from the current search by just jumping to a completely different program every time (jumping to a completely different program occasionally is good, because it helps the algorithm escape from local minima). As an example, consider the system of Schkufza, Sharma and Aiken. In that system, the goal is to synthesize a sequence of assembly instructions $\mathcal{R}$ that is equivalent to some reference program $\mathcal{T}$ but more efficient. The program space: The program space for the paper corresponds to sequences of assembly instructions of a bounded length. The bound on the length is what makes the space finite.- $p_c$: the probability of swapping the opcode of an instruction in $\mathcal{T}$ for a compatible opcode, one that requires the same number and type of parameters. That means that given a program $\mathcal{T}$, if there are $K$ ways of swapping any of it's opcodes for another legal one, then the actual probability $J(\mathcal{T}, \mathcal{R})$ for another program $\mathcal{R}$ that differs from $\mathcal{T}$ by the opcode of one of its instructions would be $p_c / K$.
- $p_o$: the probability of replacing one random operand from one random instruction in $\mathcal{T}$ with an alternative legal operand.
- $p_s$: the probability of transforming from $\mathcal{T}$ to $\mathcal{R}$ by swapping two instructions.
- $p_i$: this is the probability of replacing an instruction in $\mathcal{T}$ with a random instruction.
- $p_u$: the probability of replacing an instruction in $\mathcal{T}$ with a NOOP.
Metropolis-Hastings with ASTs
The MH approach has also been used to search simpler expression trees in an expression language. For example, the survey by Alur et al. AlurBJMRSSSTU13 which introduced the syntax guided synthesis format (SyGuS) describes a stochastic synthesis approach over ASTs. As before, the approach is sumarized by the space of programs, the proposal distribution and the stationary distribution. The program space: In this setting, the space of programs corresponds to the space of ASTs up to a size bound $N$. The proposal distribution: As in the previous example, the proposal distribution is described in terms of the probabilistic process by which a program trying to simulate the Markov process would transition from a program $e$ to a program $e'$. In this process, the algorithm first picks an AST node $v$ uniformly at random from the AST for $e$. Let $e_v$ be the entire sub-expression rooted at $v$. The expression at $v$ can then be replaced by a randomly generated expression of the same size as $e_v$. The stationary distribution The stationary distribution, is defined in terms of a cost function $C(e)$. \[ \pi(e) = \frac{1}{Z} exp(−\beta C(e)) \] Where the cost $C(e)$ is equal to the number of examples that the function got wrong. Discussion This very generic algorithm is quite simple, although it is not very effective, as the comparison with other algorithms done in the survey AlurBJMRSSSTU13 illustrated. Although the root causes for the poor performance were not explored in the paper, we can suggest a few possibilities. First, the stationary distribution, the scoring function, is not precise enough. Programs that are close to correct but still fail all the examples will get the same low score as completely wrong programs. In the superoptimization paper, by contrast, the fact that a program will potentially touch many different memory locations and registers means that almost correct programs have a higher likelihood of getting at least some bits of the output correct. Another thing we observe in this approach is that the proposal distribution can really only make big changes to the program. If a program is close to correct, and only one node in the middle of the AST has to be changed, then the proposal distribution cannot make this local change without re-randomizing the entire subtree below that AST node. In general, the MH approach can be a very powerful search method, particularly in cases where we need to search for programs larger than what explicit search techniques, and where we have enough intuition about the space to define good proposal distributions and scoring functions. Carefully defining these two functions is crucial for this method to actually work.Sequential Monte Carlo
In summary, MCMC is as a way of transforming a single sample from the proposal distribution $J$ into a representative sample from a stationary distribution $\pi$. From a physical perspective, we can think of this as observing how a particle changes its position over time, with the stationary distribution describing the stable positions of the particle in the long run. Sequential Monte Carlo (SMC) is essentially a generalization of that idea, where we instead start with a set consisting of $N$ particles and we have some idea of how the particles should evolve over time. Importantly, unlike what would happen if we simply ran several instances of MCMC in parallel, in SMC, the particles are not independent. Instead, they interact with each other through a process of resampling, where particles that "look more promising" are more likely to be retained in the next step. In this way, SMC adaptively allocates more compute to more highly promising particles. Formally speaking, SMC begins with:- An initial distribution $J_0(x)$ from which we are able to sample.
- A set of particles $X^{(0)} = \{x^{(0)}_1, x^{(0)}_2, ..., x^{(0)}_N\}$ sampled independently from $J_0$.
- A transition kernel $J_t(x^{(t+1)} \mid x^{(t)})$ from which we are also able to sample. (Oftentimes, it is convenient to assume that this kernel does not depend on $t$.)