Lecture 1: Introduction and definitions
The dream of automating software development has been present from
the early days of the computer age. Already back in 1945, as part of his vision
for the Automatic Computing Engine, Alan Turing argued that
Instruction tables will have to be made up by mathematicians with computing
experience and perhaps a certain puzzle-solving ability…
This process of constructing instruction tables should be very fascinating.
There need be no real danger of it ever becoming a drudge, for any processes
that are quite mechanical may be turned over to the machine itself.
Traditionally, the way automation was incorporated into software development was through
the use of compilers and high-level languages.
When the first FORTRAN compiler was developed, it was
touted as "The FORTRAN Automatic Coding System", it's
goal was nothing less than to allow the IBM 704 to
code problems for itself and produce as good programs
as human coders (but without the errors)
Compilation and synthesis are very closely related in terms of their goals, they both
aim to support the generation of software from a high-level description of its
behavior. In general, though, we expect a synthesizer to do more than translate
a program from one notation to another as traditional compilers do;
we expect it to discover how
the desired task. The line can be blurry since some aggressive optimizing
compilers can be argued to actually discover how to perform a computation that was
specified at a high-level of abstraction (parallelizing compilers are a good example).
One distinguishing feature between a compiler and a synthesizer is the element of search
In a compiler, an input description of the computation is transformed into a program
by applying transformation rules according to a pre-defined schedule. By contrast,
a synthesizer is generally understood to involve a search for the program that
satisfies the stated requirements. Again, the line is blurry, because several
modern research compilers aggressively search the space of transformations to find
optimal implementations in a process known as Autotuning
Another class of techniques that is closely associated with synthesis is
, and in particular logic programming
The dream of logic programming was that programmers would be able to express
the requirements of their computation in a logical form, and when given an input,
the runtime system would derive an output that satisfies the logical constraints through
a combination of search and deduction. So the goals are also closely related to program synthesis,
but there are some important distinctions. First, rather than trying to discover an algorithm
to solve a particular problem, logic programming systems rely on a generic algorithm to search
for a solution to every problem. Thins means that for many problems, they can be dramatically slower
than a specialized program to solve a particular task. Additionally, if the problem is under-specified,
the user may get a solution at runtime that is very far from the solution that was expected
for the program.
Finally, Machine Learning corresponds to a third class of approaches that are closely related to program synthesis.
The canonical problem in Machine Learning is finding a function whose behavior closely matches a given dataset.
So machine learning problems can be thought of as a special case of program synthesis problems where the specification
comes in the form of data. The biggest distinction between program synthesis and Machine Learning is that in Machine Learning,
the space of functions that the algorithm considers is very tightly prescribed. For example, linear classifiers, decision
trees and neural networks are some examples of classes of functions that have been very well studied, and each of these
classes has its own specialized set of algorithms for deriving a function that matches a dataset. By contrast,
in program synthesis we are interested in general algorithms that can work with more general classes of programs,
with a particular interest in programs that support recursion or other forms of iteration. Traditionally, there
was a second important distinction in that program synthesis generally aspired to discovering programs
that precisely matched the specification, whereas in machine learning the notion of learning from noisy data is deeply
engrained in all algorithms. This distinction is less relevant today since there is strong interest in the synthesis
commonity on algorithms that are robust to noise, or that behave well in the presence of incomplete specifications.
So if program synthesis is not compilation, it is not logic programming,
and it is not machine learning, then what is program synthesis?
As mentioned before, different people in the community have different working definitions
of what they would describe as program synthesis, but I believe the definition below is one
that both captures most of what today we understand as program synthesis and also
excludes some of the aforementioned classes of approaches.
Program Synthesis correspond to a class of techniques that are able to generate a program
from a collection of artifacts that establish semantic and syntactic requirements for the generated code.
There are two elements of this definition that are important. The first is an emphasis on the generation of a program;
we expect the synthesizer to produce code that solves our problem, as opposed to relying on extensive
search at runtime to find a solution for a particular input, as logic programming systems do.
The second is the emphasis on supporting specification
of both semantic and syntactic requirements. We expect synthesis algorithms to provide us with some control
over the space of programs that are going to be considered, not just their intended behavior.
It is important to emphasize that individual synthesis systems may not themselves provide this flexibility; in fact,
the biggest successes of synthesis so far have been in specialized domains where constraints on the space of programs
have been "baked in" to the synthesis system. Nevertheless, even if the flexibility is not exposed to the users,
the underlying synthesis algorithms do have significant flexibility in how the space of programs is defined, and
this is a big differentiator both with respect to compilation and with respect to machine learning.
In general, both of these requirements imply that our synthesis procedures will rely on some form of search,
although the success of synthesis will be largely driven by our ability to avoid having to exhaustively search
the exponentially large space of programs that arise for even relatively simple synthesis problems.
Program Synthesis Today
These days, program synthesis is an active area of research with research papers being published every year in all
the major programming systems conferences (PLDI, POPL, OOPSLA), as well as in formal methods (CAV, TACAS)
and machine learning (NeurIPS, ICLR, ICML). One important branch of synthesis research is focused on pushing the
envolope in terms of our ability to automatically generate non-trivial algorithms from very high-level artifacts.
Today, we are able to synthesize interesting algorithms such as
Karatsuba big-integer multiplicationsketchthesis
, Strassen's matrix-multiplicationSrivastava:2010
, or the functional
cartesian product algorithm of Barron and C. Strachey, which is considered the first functional pearlFeser:2015
And when it comes to the synthesis of programs from formal specifications, we are now able to synthesize provably
correct implementations of fairly complex algorithms; in a few years, we have been able to move
from things like sorting and list reversal to algorithms and data-structure manipulations such as
insertion into red-black trees or binary heapsPolikarpova:2016
Another area where synthesis is fairly advanced is in the synthesis of bit-vector manipulations.
The winner of the most recent synthesis competition was able to synthesize every bit-vector manipulation that
the organizers threw at it.
Another important branch of synthesis research focuses on the application of synthesis techniques to a broad class of problems.
For example, one area where program synthesis has been particularly succesful
is in the area of "Data Wrangling", the manipulation of data, especially by people with no prior programming experience.
The first commercial application of program synthesis was FlashFillgulwani:2011:flashfill
, a feature that was first incorporated
into Excell 2013, which allows users to do data manipulation by providing a few examples and automatically derives
a small program from the examples and applies the program to the rest of your data.
On the research side, there have been significant advances in the ability to synthesize fairly complex database queries
either from examplesWang:2017
, or from natural language queriesYaghmazadeh:2017
This area has proven to be a good fit for our current synthesis capabilities because on the one hand, there
is a strong need to make data analysis and cleaning accessible to non-programmers, and on the other hand, the
programs in question are generally small and easy to describe through examples or other forms of natural interaction.
Another area that has seen major interest is the reverse engineering of code.
Traditionally, we think of synthesis as starting with a specification and generating
an implementation from that. But in this case, that paradigm is flipped on its head: starting
from an implementation, the goal is to infer a specification that characterizes the behavior of the given
implementation. The idea was first proposed by Susmit Jha, Sumit Gulwani, Sanjit Seshia and Ashish TiwariJha:2010
It has most recently been popularized by Alvin Cheung in an approach known as Verified Lifting
, where the goal
is to discover a high-level representation that is provably equivalent to an implementation and that can be used
to generate a more efficient version of the code. The idea was first applied to the problem of generating SQL queries
that are equivalent to a piece of imperative code, but has most recently been applied to a variety of problems ranging
from modernizing legacy HPC applicationsKamil:2016
, to optimizing Map Reduce programs Ahmad:2018
Another recent example of the use of synthesis for reverse engineering involves the creation of models of complex code for
the purpose of program analysis. For example, a recent paper by Jinseong Jeon et al. Jeon:2016
it was possible to use synthesis to create models of complex reactive frameworks such as the Android and Java Swing by recording
traces of the interaction between the framework and a test application and then forcing the synthesizer to produce a model
that conforms to that trace and that follows known design patterns. A similar idea was used by Heule, Sridharan and Chandra
One particularly interesting research thrust is the application of synthesis techniques for problems that
seemingly have nothing to do with automatic programming; there is a growing realization that program synthesis techniques can
actually be applied in a number of domains that have traditionally been though of as AI. For example,
back in 2013, we demonstrated that it was possible to apply program synthesis to the problem of providing feedback
for programming assignmentsSingh:2013
; similar ideas have been used to other
forms of automated tutoring, from teaching automata theoryD'antoni:2015
More broadly, one of the more exciting research directions around program synthesis is the interaction between
synthesis and machine learning. On the one hand, there has been a lot of recent interest in applying machine
learning techniques to synthesis problems, for example to be able to learn how to use complex APIsMuraliCJ17a
But there is also significant interest in applying ideas from program synthesis to machine learning problems,
for example, in order to allow you to learn with less data or to generate interpretable models.
For example, in recent work, we showed that it was possible to learn language morphology rules from
small numbers of examples, or to learn visual concepts much more efficiently than with traditional machine learningEllisST15
In general, there are three major challenges one has to address when working with program synthesis.
In a recent paper, we refer to these as the Three Pillars of machine programmingGottschlichSTCR18
The first challenge is what we have termed the Intention challenge: how do the users tell you their goals?
The definition of synthesis talks about semantic and syntactic constraints, but the exact form of these
will influence all subsequent decisions about the synthesis system. The success of the
FlashFillgulwani:2011:flashfill system has popularized the use of input-output examples
as a means of specification, but input-output examples are not suitable for every task.
In our own work on storyboard programming, we advocated for an approach to multi-modal synthesis,
where concrete examples were combined with abstract examples and logical specifications, so that
together they provided enough information about the intended behavior to produce a working
implementation of a data-structure manipulationSinghS12.
One big aspect of the intention challenge is how to cope with under-specification. If there are multiple
programs that satisfy the requirements, how can we tell which one the user actually wants?
Of course one solution is to simply ignore this problem, if the user provides a partial specification,
they have no right to complain if they get a different program from the one they wanted. In practice, though,
making a good choice can make the difference between a system that is useful, and one that is not.
The second challenge once we know what the user wants is to actually discover a piece of code that will
satisfy those requirements. Arguably this is the central challenge of synthesis, as it potentially involves
inventing new algorithmic solutions to a problem. One of the key questions we will be dealing with in this
course are different techniques that the community has developed to tackle the inherent complexity of this task.
The canonical view of synthesis is that the user is creating a brand new algorithm from scratch, and wants to leverage
a synthesizer to create a correct implementation of the desired algorithm. However, most software development involves
working in the context of existing software systems, fixing bugs, optimizing code, and performing other kinds of maintenance
tasks. This pillar deals with the question of synthesis in a broader context, and the application of synthesis ideas
to broader software development tasks beyond green-field software creation.
As we will see later, there are a number of compelling applications of program synthesis in support of the broader
software development process.