Introduction to Program Synthesis

© Theo X. Olausson, 2025. All rights reserved.

TODO:

Changelog:

Lecture 13: Program Synthesis and Evolutionary Search

We have made quite a bit of headway in the last few lectures. We have seen how to frame program synthesis as a language modeling task, and discussed several language modeling techniques that can be used to do so; we have discussed how to obtain samples from that language model that satisfy hard constraints, such as input/output examples or logical formulae; and how to adapt the model to new domains, such as a new domain-specific programming language, through finetuning, RAG and TTT. But one thing we have not yet thought much about is how to search for programs that satisfy soft (but quantifiable) constraints, such as minimizing the number of operations used in the program, or maximizing the reward we get from executing the program in some environment. For example, if our program is a game-playing agent, we might want to maximize the score it achieves in the game. In such situations, we may have little to no idea what the ideal program should even look like; finding improvements will thus require a type of hill-climbing, in which we iteratively improve a candidate program by making small changes to it, and evaluating the result.

These types of constraints actually pop up quite frequently when we talk about program synthesis outside of the context of software engineering. For example, in scientific contexts, we might want to synthesize a program that simulates a physical system, and we might want to minimize the number of operations in the program to make it run faster. Or, the program we are looking for is a scheduling algorithm, and we want to maximize the throughput of the schedule. Program synthesis is a very appealing approach to these problems, because it can allows us to discover solutions that are not only more efficient than the ones we would have come up with ourselves, but which still (unlike many other machine learning approaches) remain human-readable, verifiable and debuggable.

The techniques we have discussed so far in Unit 2 have recently proven to be quite effective in such situations; hill climbing is essentially a generalization of the self-improvement procedure that we already saw in Lecture 9. However, it can be made much more effective if combined with a search strategy that can explore the space of programs more efficiently than just making small changes to a single candidate program. In today's lecture, we will see how language model-driven program synthesis can be combined with evolutionary search to enable such advancements. As two examples of this, we will look at Google DeepMind's recent FunSearch and AlphaEvolve projects, which were able to synthesize algorithms that exceeded the best previously known algorithms for important problems such as matrix multiplication and bin-packing, with significant ramifications.

Evolutionary Search

Suppose we want to search a space $\chi$ to find a solution $x^* \in \chi$ that maximizes some objective function $f: \chi \to \mathbb{R}$. If $f$ is differentiable and sufficently "nice" (e.g. convex), we could hope to use gradient descent to find $x^*$, starting from some initial point $x_0 \in \chi$. However, in reality, $f$ is often not differentiable, or not even continuous, and the space $\chi$ is often discrete, or at least not well-structured. This is indeed the case, for example, if $\chi$ is the space of all programs in some programming language, and $f$ is the number of operations in the program. In such cases, how should we best proceed to find $x^*$?

Evolutionary search is a type of search strategy that is inspired by the process of natural evolution, and which can be useful in such situations. The core idea is to use a population of candidate solutions, which are iteratively improved over time through a process of selection, mutation, and recombination. At a very high level, the algorithm works as follows:

  1. Initialize a population of candidate solutions $P = \{x_1, x_2, \ldots, x_n\}$, where $n$ is the population size.
  2. Evaluate the fitness of each candidate solution $x_i$ in the population, using the objective function $f$, obtaining fitness scores $f_i \triangleq f(x_i)$.
  3. Repeat until convergence:
    1. Select a subset of the best-performing candidates from the population. (For example, we could select the top-k candidates with the highest fitness scores $f_i$.)
    2. Apply crossover (combine previous solutions into a new solution) or mutation (randomly modify a solution) operators to create a new population of candidate solutions $x'_1, x'_2, \ldots, x'_n$.
    3. Evaluate the fitness for each candidate in the new population, obtaining fitness scores $f'_i \triangleq f(x'_i)$.
    4. Add the new samples to the population. (Optionally, keep the population size fixed by removing the worst-performing candidates.)
There are several improvements that can be made to this basic algorithm, such as for example clustering the population into independent "islands" to ensure diversity in the search space, but we will not go into those details here.

Let's think about how we can apply this to program synthesis, using a language model $p_\theta$. First, we could initialize the population with a set of programs $P$ drawn from $p_\theta(\cdot ; c)$, where $c$ is some description of the task we want to solve. Next, we need to evaluate the fitness of each candidate; fortunately, since the candidate is a program, we can simply execute it in some environment and measure its performance. So far so good. Where things get interesting is in the genetic operators we use to create new candidates. Because language models have such flexible interfaces, an entire design space of genetic operators is available to us. For example, we could follow the self-improvement procedure from Lecture 9, and use the language model to generate new programs that are similar to the best-performing candidates in the population. Perhaps we could even stuff the entire population into the prompt, along with all of their fitness scores, and hope the model is able to come up with some interesting hypothesis about how to combine them.

The key point is to leverage the LLM to generate new candidates which are not just random mutations, but which are informed by the existing population, their fitness scores, and the task description. This is what makes evolutionary search so powerful in this context.

Retrieval-Augmented Reinforcement Learning

Do we have any specific references we want to talk about here? If the candidates do not fit into the context window, we could stuff the entire history of the population in a database and use RAG to retrieve some desirable subset of it; for example, we could pair a candidate solution with high-performing but dissimilar samples from the population, in the hopes that the model will be able to combine their strengths. This can be seen as a form of retrieval-augmented reinforcement learning, where the learning happens in the form of growing the database of candidates, rather than in the model parameters.

For those who are not familiar with reinforcement learning, suffice to say that it is a type of machine learning where an agent learns by interacting with an environment, receiving rewards or penalties based on its actions. We will cover (parameter-space) reinforcement learning in more detail in the next lecture.

FunSearch

Evolutionary search and retrieval-augmented reinforcement learning have been used to great effect in the FunSearch project by Google DeepMind. This project showed that it was possible, using pretty much exactly the techniques we have discussed here, to synthesize algorithms that outperform laboriously hand-tuned algorithms for problems such as online bin-packing, and to make mathematical discoveries.

In FunSearch, the authors used a pre-trained language model (that had been trained extensively on code, but not specifically on any particular domain) to generate the candidate solutions. They then used evolutionary search to iteratively improve the candidates, using a fitness function that measured the performance of the candidate solutions in the target environment, exactly as we have described above. Any sample that did not throw an error when executed was then added to the ever-growing population of candidates, which was stored in a database. Since each search targeted only one specific problem, the retrieval of relevant candidates to evolve was done by simply sampling from the database. $k$ such samples were retrieved, and put into the prompt in order of their fitness scores along with the task description.

The core novel finding of the FunSearch project was that this simple approach was sufficient to surpass the best known results for finding cap sets in a graph, which is a well-studied problem in combinatorics: by evolving programs which search the graph for cap sets, they were able to identify a cap set of size 512 in $\mathbb{Z}_3^8$, which surpassed the previous best known result of 496. What is remarkable about this result is that it was achieved without any domain-specific knowledge or hand-tuning of solution space. At the same time, the result itself remains human-readable and verifiable thanks to framing the problem as a program synthesis task. This was perhaps the first time that we observed hard evidence that large-scale program synthesis could help us discover and understand new results in mathematics and science.

AlphaEvolve

Very recently, FunSearch was followed up by the AlphaEvolve project, which applied the same techniques to a much larger scale. There are two key tecnical differences between FunSearch and AlphaEvolve. First, AlphaEvolve mutates entire program files, rather than just single functions, as was done in FunSearch. This allows for more complex and modular solutions to be synthesized, at the cost of significantly increased computational requirements. The second change is that AlphaEvolve introduced a meta-prompting strategy, where the model also independently searched for and generated instructions for how to solve the problem, not just the solutions themselves. These instructions are then stored and sampled from the database, just like the solutions. One thing that is actually not clear to me from the AlphaEvolve paper is how these instructions are scored? You could defining the fitness of the instruction as being the expected fitness of the solutions it generates and evaluating that with a monte carlo estimate, but that seems like it would be incredibly expensive and way too high variance to be practical.

Behind the scenes, AlphaEvolve also benefitted from other advances made in language modeling since FunSearch. In particular, the language model itself was much larger and more powerful. From an engineering perspecetive, the authors also built a significant amount of infrastructure to support arbitrary fitness functions (such as, for example, training and evaluating a neural network!), which allowed them to apply the techniques from FunSearch to a much wider range of problems.

The keystone result of AlphaEvolve was that it was able to synthesize a new algorithm for 4x4 matrix multiplication that outperformed the previously best-known algorithm at the time, reducing the number of operations required to multiply two 4x4 matrices from 48 to 49. While this may not sound like a big improvement, it is quite remarkable given that the previous best-known algorithm had been discovered in 1969, and despite decades of research in the field no one had been able to improve on it.

While FunSearch may have been a proof of concept that large-scale program synthesis could be used to discover new algorithms, DeepMind have signaled that AlphaEvolve is already being applied across several domains of Google to improve real-world systems. An early version of AlphaEvolve has already been used to improve the performance of Google's data center scheduling, recovering 0.7% more stranded resources than could previously be done, by discovering a new scoring function that is used to match jobs to resources in a way that is tailored to Google's computing workloads. What is most exciting about this result is that the resulting scoring function is remarkably simple: def alpha_evolve_score(required, free): cpu_residual = required.cpu / free.cpu memory_residual = required.memory / free.memory return -1.0 * (cpu_residual + memory_residual + mem_residual / cpu_residual + cpu_residual / mem_residual) Fitting such scoring functions in an automated data-driven way has been done before with reinforcement learning, but that resulted in black-box scoring functions that were difficult to debug, and frightening to deploy in production. By contrast, treating it a synthesis problem allowed the authors to discover a simple, human-readable scoring function that is significantly easier to trust with real-world resources.

Another exciting result from AlphaEvolve is that the authors were able to use it to improve compiled-generated intermediate representation (IR) code. Such code is incredibly complex and difficult for humans to modify or even understand, since it is generated by compilers which themselves may make several optimizations and transformations to the code. However, DeepMind found that they were able to use AlphaEvolve to synthesize new IR code for their implementation of FlashAttention dao2022flashattention, yielding a 32% speedup of this already highly optimized code and potentially leading to an overall 1% improvement in Google's Gemini training platform.