Lecture 2: Introduction to Inductive Synthesis
One of the simplest interfaces for program synthesis is Inductive synthesis. In inductive synthesis, the goal is to generate a function that matches a given set of input/output examples. The literature makes a distinction between Programming by Example (PBE), and Programming by Demonstration (PBD). In Programming by example, the goal is to infer a function given only a set of inputs and outputs, whereas in Programming by Demonstration, the user also provides a trace of how the output was computed. For example, in Programming-by-example, if I want to convey to the system that I want it to synthesize the factorial function I may give it an example:
$\text{factorial}(6) = 720$
As one can see, this is a highly under-specified problem, since there is an enormous number
of possible functions that may return $720$ given the input $6$. By contrast,
with programming-by-demonstration, one may provide a more detailed trace of the computation:
$\text{factorial}(6) = 6 * (5 * (4 * (3 * (2 * 1)))) = 720$
In general, the full trace in programming-by-demonstration contains more information that
makes it easier to infer the intended computation. The line between PBD and PBE can be blurry,
however. For some domains (like string manipulation), it’s relatively easy to derive a trace from an output,
and different systems may capture the trace at different levels of fidelity.
History
The idea of directing a computer through examples dates back to the 1970s, when Patrick Winston at MIT published the seminal work “Learning Structural Descriptions from Examples”Winston:1970. This work was among the first to look into the problem of generalizing from a set of observations, although it was not really about trying to automate programming. A good candidate for “first PBD system” is PygmalionSmith:1976. Pygmalion was framed as an "interactive 'remembering' editor for iconic data structures"; the high-level idea was that a program would move icons around and establish relationships between them, but the user would not write the program itself; instead, the user would manipulate the icons directly and the editor would remember the manipulations and be able to apply them in other contexts. This was explained through their "Basic pygmalion metaphor":
a program is a series of
EDITING CHANGES to a DISPLAY DOCUMENT. Input
to a program is an initial display document, i.e. a display screen
containing images. Programming consists of editing the document.
The result of a computation is a modified document containing the desired information.
(emphasis in the original)Smith:1976
The idea of a "remembering editor" was that if you perform such a manipulation
by hand once, the editor can remember how to perform the manipulation so you can
apply a similar manipulation in other contexts. Like a lot of early AI work,
PYGMALION was very heavy on philosophy and metaphor and very weak on algorithms,
particularly around the crucial question of how to generalize from a given
demonstration so that the learned program could apply in other situations.
At around the same time, Summers looked more systematically at the question of how to
generalize from a demonstration to a program, particularly at the question
of how to derive looping structure Summers:1976Summers:1977. His algorithm
was based on pattern matching, and was relatively brittle, but is still considered
an important algorithm in the field.
Framing the PBD/PBE Problem
What is a program?
This is a good point to consider the question of what we actually mean when we talk about a program in the context of program synthesis. A program is a description of how to perform a computation. In general, describing a program requires a notation, a programming language that allows you to describe many different computations by composing individual syntactic elements each with well defined meaning. We are all familiar with popular programming languages such asPyton
, JavaScript
or C
.
At the other extreme, the notation of arithmetic, for example, can also be considered a programming language; it includes
syntactic elements such as numbers and arithmetic operators ($+$, $-$, $\times$), each with
a well defined meaning which can be used to describe a particular kind of computation. Unlike a language
like Python
, the language of arithmetic is not universal; it can only be used to described
a very narrow class of computations, so we can use it to compute the tip in a restaurant bill, but not
to determine the smallest element in a list of numbers.
As we will see over the rest of this course, the choice of an appropriate notation is crucial to the
success of program synthesis. Even if your end goal is to generate code in Pyton
or C
,
it is often useful to frame the program synthesis problem in terms of a more narrow notation that
more precisely captures only those programs that are actually relevant to the task at hand.
We will often refer to these more specialized langauges as Domain Specific Languages (DSLs).
It is usually convenient to think of them simply as subsets of more general languages, where
the programmer is prevented from using particular constructs, and is provided only with a limited
set of functions or subroutines.
Throughout the rest of this course, we will be defining small DSLs as synthesis targets. Often, it will
be enough to define their semantics informally or through examples, or where more precision is warranted,
we will define it in terms of a general purpose programming language. In many settings, we will be using
the notation of functional programming, which will be familiar to anyone who has programmed in Haskell
or Ocaml
, but may seem a bit foreign to some. We will say more about this notation when we use it,
but at a high-level, these are a few things to keep in mind about this notation, and why it makes a good
target for synthesis:
- No side effects. Computation in a functional language happens by evaluating pure functions, with no side effects and no mutation. This can be annoying when programming by hand, but can often simplify the job of the program synthesizer. A consequence of this is that functions are actually pure functions, if you give them the same inputs, they will produce the same outputs. This can also simplify the reasoning process significantly.
-
Concise and expressive.
Think about a program in Java that reverses a list. It's a relatively long program that involves a class declaration, some method declarations,
some loops, some constructors; maybe it would look something like this:
class ListReverser{ static List reverseList(List myList) { List output = new ArrayList(); for (int i = 0; i < myList.size(); i++) { output.add(myList.get(myList.size() - 1 - i)); } return output; } } You can probably write code that is slightly cleaner than the code above, but not by much. From a synthesis perspective, that is a lot of code to write with a lot of opportunities to get it wrong. By contrast, in Haskell, a function to reverse a list can be defined like this:reverse lst = case lst of [] -> [] head:rest -> (reverse rest) ++ [head] The whole function is a single expression that says that if the list is an empty list, you just return the empty list, and if it is not empty then it will have a head followed by the rest of the list, and then you should reverse the rest of the list and concatenate that with a list containing only the head. This conciseness is very useful when synthesizing programs, because it allows you to synthesize non-trivial programs while only having to discover small amounts of code. It is important to note that while LLMs are better suited to dealing with verbose languages like Java, the need for conciseness doesn't go away when using LLMs. Longer programs require more resources and bigger attention windows, and there are more opportunities for something to go wrong.
Representing Programs
In standard programming, programs are represented as strings of text that must follow a grammar, but this representation used to be frowned upon for program synthesis because it is very sparse (most strings are not valid programs), and because it is wasteful to represent programs as strings just to immediately parse those strings into a datastructure. In recent years, though, string representations have gained significant popularity thanks to the success of language models. Today, strings and structured representations are likely to coexist in systems that combine symbolic and neural techniques. For techniques that predate language models, the preferred representation has traditionally been a datastructure known as an Abstract Syntax Tree (AST), which is just a tree with different kinds of nodes for different kinds of constructs in the language. There is usually a very close correspondence between the structure of the AST and the structure of a parse tree of the program. What makes it abstract is that the data-structure can usually ignore information about things like spacing or special characters like brackets, colons and semicolons, since the information they encode can instead by captured by the structure of the AST. As an example, consider the language of arithmetic expressions. The syntax of such a language can be represented as the following CFG:
$ \begin{array}{lcl}
expr & := & term ~ | \\
~&~& term + expr \\
term & := & ( expr ) ~ | \\
~&~& term * term \\
~&~& N \\
\end{array}
$
The grammar captures a lot of syntactic information about the language. It describes for example, that
in an expression $5 + 3 * 2 $, the multiplication takes precedence over the addition, but we can change
that by adding parenthesis around $5 + 3$.
An AST, however, can afford to ignore these syntactic details. For this example, we could instead define an AST as a
data-structure with three different types of nodes:
Plus (Num 5) (Times (Num 3) (Num 2))
, while the second would be constructed
as Times (Plus (Num 5) (Num 3)) (Num 2)
.
As mentioned before, synthesis approaches based on deep learning generally prefer to represent programs as strings.
In the early days, it was not entirely clear that this was the right approach due to all the aforementioned advantages of
structured representations. But string representations made it possible to benefit from the significant engineering efforts being put into
neural architectures for natural language processing, and they also made it possible to benefit from pretrained models trained on
large amounts of data from the internet, so strings (or more precisely token sequences as we will see in Unit 2) eventually replaced all other representations when it came to working with
neural models.
Nevertheless, structured representations remain essential for symbolic approaches, and they are also useful in hybrid approaches that combine symbolic and neural techniques.
Solving the Invention Challenge in PBE
Let us now return to the problem at hand. Having decided to represent specifications as input-output pairs, we can shift our attention to the Invention challenge. This will require answering two questions: What the space of programs is, and how it is going to be searched.Defining the Program Space
Having decided on a program representation, the next step is to define the space of programs that the synthesizer is allowed to consider. Option 1: Domain-Specific Languages Perhaps the most natural way to define a space of programs is to construct a small domain-specific language (DSL), and then set the program space to be the set of all possible programs that lay within this language. A DSL will typically be defined through a context free grammar (CFG) or directly as an AST datatype, with semantics associated with each element and their composition. Both the CFG representation and ASTs have the advantage that it is easy to derive all of the possible programs in the language by simply expanding the non-terminal symbols in the grammar, which makes this representation particularly suitable for enumerative search strategies (which we will discuss shortly). In addition to a context free grammar and a semantics, it is often very useful to have a type system associated with the language. The type system provides an efficient mechanism to rule out programs that while legal with respect to the grammar, are not well-formed with respect to the semantics of the language. Option 2: Parametric Representations In contrast, constraint-based approaches often rely on parametric representations of the space, where different choices of parameters correspond to different choices for what the program will look like. Parametric representations are more general than grammars; you can usually encode the space represented by a grammar with a parametric representation as long as you are willing to bound the length of programs you want to consider---this is because parametric representations assume the set of parameters is fixed. Sketches, which we will discuss in Lecture 5, are a good example of this kind of representation. Option 3: Ad-hoc restrictions In some cases, it is necessary to restrict the space of programs in an ad-hoc manner, for example, ruling out programs beyond a certain length, or allowing them to use a particular construct only a few times. This is often done in order to make the search more efficient, or to avoid generating programs that are too complex to be useful. These kinds of ad-hoc restrictions are often used on top of a DSL or a parametric representation, but in principle they can be used directly on top of a general programming language such as Python or Java. All the options above provide different ways to define a space of programs that is constrained enough to be searched efficiently. When using traditional synthesis techniques, it is relatively easy to focus the synthesis process to consider only programs in a particular space. With neural techniques, however, it can be harder to force the synthesizer to consider only programs within the desired space, but techniques such as constrained decoding and prompt engineering can be effective as we will see in Lecture xx.Example
As a running example, consider the following language:
$
\begin{array}{rcll}
lstExpr & := & sort(lstExpr) & \mbox{sorts a list given by lstExpr.} \\
~ & ~ & lstExpr[intExpr,intExpr] & \mbox{selects sub-list from the list given by the start and end position}\\
~ & ~ & lstExpr + lstExpr & \mbox{concatenates two lists}\\
~ & ~ & recursive(lstExpr) & \mbox{calls the program recursively on its argument list; if the list is empty, returns empty without a recursive call} \\
~ & ~ & [0] & \mbox{a list with a single entry containing the number zero} \\
~ & ~ & in & \mbox{the input list } in \\
intExpr &:= & firstZero(lstExpr) & \mbox{position of the first zero in a list} \\
~ & ~ & len(lstExpr) & \mbox{length of a given list} \\
~ & ~ & 0 & \mbox{constant zero} \\
~ & ~ & intExpr + 1 & \mbox{adds one to a number} \\
\end{array}
$
In this language, there are two types of expressions, list expressions $lstExpr$, which evaluate to a list, and
integer expressions $intExpr$ which evaluate to an integer. Programs in this language have only one input,
a list $in$.
On the one hand, the language is very rich; it includes recursion, concatenation, sorting, search; you can write a ton of interesting programs with this language. For example, the program to reverse a list would be written as follows:
$ recursive(in[0 + 1, len(in)]) + in[0, 0] $
Now, consider the following input/output example:
Searching the Program Space
Explicit Enumeration
One class of search technique is Explicit enumeration. At a high-level, the idea is to explicitly construct different programs until one finds a program that satisfies the observations. In general, though, the space of possible programs that one can generate to satisfy a given specification is too large to enumerate efficiently, so a key aspect of these approaches is how to avoid generating programs that have no hope of satisfying the observations, or which can be shown to be redundant with other programs we have already enumerated. An important distinction in explicit enumeration techniques is whether they are top down or bottom up. In bottom up enumeration, the idea is to start with low-level components and then discover how to assemble them together into larger programs. By contrast, top-down enumeration starts by trying to discover the high-level structure of the program first, and from there it tries to enumerate the low-level fragments. Essentially, in both cases we are explicitly constructing ASTs, but in one case we are constructing them from the root down, and in the other case we are constructing them from the leafs up. For example, suppose we want to discover a program $reduce ~ (map ~ in ~ \lambda x. x + 5) ~ 0 ~ (\lambda x. \lambda y. (x + y))$. In a bottom up search, you start with expressions like $(x+y)$ and $(x+5)$ and build from those expressions to the functions such as $\lambda x. x + 5$ and from there you assemble the full program. In contrast, top down search would start with an expression such as $reduce ~ \Box ~ \Box ~ \Box$, and then discover that the first parameter to reduce is $map ~ \Box ~ \Box$, and progressively complete the program down to the low-level expressions. While enumeration may sound a bit brutish, its power should not be underestimated. Often times, it is the simplest techniques that lend themselves to the most efficient implementations, and as Richard Sutton highlighted in the Bitter Lesson, making effective use of computational resources will in time often trump any cleverness in the design of the algorithm. Furthermore, as we will see in the next few lectures, there are many ways to constrain the space which we must enumerate without sacrificing much in terms of simplicity.Symbolic Search
In explicit search, the synthesizer always maintains one or more partially constructed programs that it is currently considering. By contrast, in symbolic search techniques the synthesizer maintains a symbolic representation of the space of all programs that are considered valid. Different symbolic representations lead to different search algorithms. Two of the most popular symbolic representations in use today are Version Space Algebras and Constraint Systems, which we will briefly touch on in Lecture 7. As an analogy, suppose we want to search for an integer value of $n$ such that $4*n = 28$. An enumerative search would try all the values one by one until it got to $n=7$, and then it would declare success. By contrast, a symbolic search technique may perform some algebraic manipulation to deduce that $n=28/4=7$. In this case, symbolic search is clearly better, but even for arithmetic, symbolic manipulation is not always the best choice. Binary search, for example, can be considered a form of explicit search that is actually quite effective in finding solutions to equations that may be too complicated to do algebraic manipulation efficiently.A Brief Note on Symmetries
One important aspect in defining the space of programs which we have so far glossed over is the question of symmetries. In program synthesis, we say that a program space has a lot of symmetries if there many different ways of representing the same program. For example, consider the following grammar:
$ \begin{array}{lcl}expr & := & var * N ~ | \\ ~&~&expr + expr \end{array} $
Now, if we wanted to generate the expression $w*5+ x*2 + y*3 + z*2$, the grammar above
allows us to generate it in many different ways.
$ \begin{array}{c}
(w*5+ x*2) + (y*3 + z*2) \\
w*5+ (x*2 + (y*3 + z*2)) \\
w*5+ ((x*2 + y*3) + z*2) \\
((w*5+ x*2) + y*3) + z*2 \\
\ldots
\end{array} $
So the grammar above is said to have a lot of symmetries. By contrast, we can define a program
space with the grammar below.
$ \begin{array}{lcl}expr & := & var * N ~ | \\ ~&~&(var * N) + Expr \end{array} $
Now, only the second expression in the list above can be generated by this grammar. This grammar
in effect forces right associativity of arithmetic expressions, significantly reducing the symmetries
in the search space. There are still symmetries due to commutativity of addition, but we have
eliminated at least one source of them.
Does this matter? It depends on the search technique and the representation we are using of the search space.
Constraint based techniques and some enumerative techniques can be extremely sensitive to symmetries, and will benefit enormously from
a representation of the space that eliminates as many of them as possible. On the other hand, there are
some techniques that we will study that are mostly oblivous to symmetries.