Introduction to Program Synthesis

© Armando Solar-Lezama. 2018. All rights reserved.

Lecture 1: Introduction and definitions

The dream of automating software development has been present from the early days of the computer age. Already back in 1945, as part of his vision for the Automatic Computing Engine, Alan Turing argued that

Instruction tables will have to be made up by mathematicians with computing experience and perhaps a certain puzzle-solving ability… This process of constructing instruction tables should be very fascinating. There need be no real danger of it ever becoming a drudge, for any processes that are quite mechanical may be turned over to the machine itself. copeland2012alan

Traditionally, the way automation was incorporated into software development was through the use of compilers and high-level languages. When the first FORTRAN compiler was developed, it was touted as "The FORTRAN Automatic Coding System", it's goal was nothing less than to allow the IBM 704 to code problems for itself and produce as good programs as human coders (but without the errors) Backus:1957 .

Compilation and synthesis are very closely related in terms of their goals, they both aim to support the generation of software from a high-level description of its behavior. In general, though, we expect a synthesizer to do more than translate a program from one notation to another as traditional compilers do; we expect it to discover how to perform the desired task. The line can be blurry since some aggressive optimizing compilers can be argued to actually discover how to perform a computation that was specified at a high-level of abstraction (parallelizing compilers are a good example). One distinguishing feature between a compiler and a synthesizer is the element of search. In a compiler, an input description of the computation is transformed into a program by applying transformation rules according to a pre-defined schedule. By contrast, a synthesizer is generally understood to involve a search for the program that satisfies the stated requirements. Again, the line is blurry, because several modern research compilers aggressively search the space of transformations to find optimal implementations in a process known as Autotuning.

Another class of techniques that is closely associated with synthesis is declarative programming, and in particular logic programming. The dream of logic programming was that programmers would be able to express the requirements of their computation in a logical form, and when given an input, the runtime system would derive an output that satisfies the logical constraints through a combination of search and deduction. So the goals are also closely related to program synthesis, but there are some important distinctions. First, rather than trying to discover an algorithm to solve a particular problem, logic programming systems rely on a generic algorithm to search for a solution to every problem. Thins means that for many problems, they can be dramatically slower than a specialized program to solve a particular task. Additionally, if the problem is under-specified, the user may get a solution at runtime that is very far from the solution that was expected for the program.

Finally, Machine Learning corresponds to a third class of approaches that are closely related to program synthesis. The canonical problem in Machine Learning is finding a function whose behavior closely matches a given dataset. So machine learning problems can be thought of as a special case of program synthesis problems where the specification comes in the form of data. The biggest distinction between program synthesis and Machine Learning is that in Machine Learning, the space of functions that the algorithm considers is very tightly prescribed. For example, linear classifiers, decision trees and neural networks are some examples of classes of functions that have been very well studied, and each of these classes has its own specialized set of algorithms for deriving a function that matches a dataset. By contrast, in program synthesis we are interested in general algorithms that can work with more general classes of programs, with a particular interest in programs that support recursion or other forms of iteration. Traditionally, there was a second important distinction in that program synthesis generally aspired to discovering programs that precisely matched the specification, whereas in machine learning the notion of learning from noisy data is deeply engrained in all algorithms. This distinction is less relevant today since there is strong interest in the synthesis commonity on algorithms that are robust to noise, or that behave well in the presence of incomplete specifications.

Program Synthesis

So if program synthesis is not compilation, it is not logic programming, and it is not machine learning, then what is program synthesis? As mentioned before, different people in the community have different working definitions of what they would describe as program synthesis, but I believe the definition below is one that both captures most of what today we understand as program synthesis and also excludes some of the aforementioned classes of approaches.
Program Synthesis correspond to a class of techniques that are able to generate a program from a collection of artifacts that establish semantic and syntactic requirements for the generated code.
There are two elements of this definition that are important. The first is an emphasis on the generation of a program; we expect the synthesizer to produce code that solves our problem, as opposed to relying on extensive search at runtime to find a solution for a particular input, as logic programming systems do. The second is the emphasis on supporting specification of both semantic and syntactic requirements. We expect synthesis algorithms to provide us with some control over the space of programs that are going to be considered, not just their intended behavior. It is important to emphasize that individual synthesis systems may not themselves provide this flexibility; in fact, the biggest successes of synthesis so far have been in specialized domains where constraints on the space of programs have been "baked in" to the synthesis system. Nevertheless, even if the flexibility is not exposed to the users, the underlying synthesis algorithms do have significant flexibility in how the space of programs is defined, and this is a big differentiator both with respect to compilation and with respect to machine learning. In general, both of these requirements imply that our synthesis procedures will rely on some form of search, although the success of synthesis will be largely driven by our ability to avoid having to exhaustively search the exponentially large space of programs that arise for even relatively simple synthesis problems.

Program Synthesis Today

Lecture1:Slide14;Lecture1:Slide15 These days, program synthesis is an active area of research with research papers being published every year in all the major programming systems conferences (PLDI, POPL, OOPSLA), as well as in formal methods (CAV, TACAS) and machine learning (NeurIPS, ICLR, ICML). One important branch of synthesis research is focused on pushing the envolope in terms of our ability to automatically generate non-trivial algorithms from very high-level artifacts. Today, we are able to synthesize interesting algorithms such as Karatsuba big-integer multiplicationsketchthesis, Strassen's matrix-multiplicationSrivastava:2010, or the functional cartesian product algorithm of Barron and C. Strachey, which is considered the first functional pearlFeser:2015. And when it comes to the synthesis of programs from formal specifications, we are now able to synthesize provably correct implementations of fairly complex algorithms; in a few years, we have been able to move from things like sorting and list reversal to algorithms and data-structure manipulations such as insertion into red-black trees or binary heapsPolikarpova:2016. Another area where synthesis is fairly advanced is in the synthesis of bit-vector manipulations. The winner of the most recent synthesis competition was able to synthesize every bit-vector manipulation that the organizers threw at it.

Another important branch of synthesis research focuses on the application of synthesis techniques to a broad class of problems. For example, one area where program synthesis has been particularly succesful is in the area of "Data Wrangling", the manipulation of data, especially by people with no prior programming experience. The first commercial application of program synthesis was FlashFillgulwani:2011:flashfill, a feature that was first incorporated into Excell 2013, which allows users to do data manipulation by providing a few examples and automatically derives a small program from the examples and applies the program to the rest of your data. On the research side, there have been significant advances in the ability to synthesize fairly complex database queries either from examplesWang:2017, or from natural language queriesYaghmazadeh:2017. This area has proven to be a good fit for our current synthesis capabilities because on the one hand, there is a strong need to make data analysis and cleaning accessible to non-programmers, and on the other hand, the programs in question are generally small and easy to describe through examples or other forms of natural interaction.

Lecture1:Slide27; Lecture1:Slide28; Lecture1:Slide29; Lecture1:Slide33; Another area that has seen major interest is the reverse engineering of code. Traditionally, we think of synthesis as starting with a specification and generating an implementation from that. But in this case, that paradigm is flipped on its head: starting from an implementation, the goal is to infer a specification that characterizes the behavior of the given implementation. The idea was first proposed by Susmit Jha, Sumit Gulwani, Sanjit Seshia and Ashish TiwariJha:2010. It has most recently been popularized by Alvin Cheung in an approach known as Verified Lifting, where the goal is to discover a high-level representation that is provably equivalent to an implementation and that can be used to generate a more efficient version of the code. The idea was first applied to the problem of generating SQL queries that are equivalent to a piece of imperative code, but has most recently been applied to a variety of problems ranging from modernizing legacy HPC applicationsKamil:2016, to optimizing Map Reduce programs Ahmad:2018.

Another recent example of the use of synthesis for reverse engineering involves the creation of models of complex code for the purpose of program analysis. For example, a recent paper by Jinseong Jeon et al. Jeon:2016 showed that it was possible to use synthesis to create models of complex reactive frameworks such as the Android and Java Swing by recording traces of the interaction between the framework and a test application and then forcing the synthesizer to produce a model that conforms to that trace and that follows known design patterns. A similar idea was used by Heule, Sridharan and Chandra to synthesize models of array manipulation routines in JavaScript Heule:2015.

Lecture1:Slide41; Lecture1:Slide43; Lecture1:Slide44 One particularly interesting research thrust is the application of synthesis techniques for problems that seemingly have nothing to do with automatic programming; there is a growing realization that program synthesis techniques can actually be applied in a number of domains that have traditionally been though of as AI. For example, back in 2013, we demonstrated that it was possible to apply program synthesis to the problem of providing feedback for programming assignmentsSingh:2013; similar ideas have been used to other forms of automated tutoring, from teaching automata theoryD'antoni:2015 to teaching deduction AhmedGK13.

More broadly, one of the more exciting research directions around program synthesis is the interaction between synthesis and machine learning. On the one hand, there has been a lot of recent interest in applying machine learning techniques to synthesis problems, for example to be able to learn how to use complex APIsMuraliCJ17a. But there is also significant interest in applying ideas from program synthesis to machine learning problems, for example, in order to allow you to learn with less data or to generate interpretable models. For example, in recent work, we showed that it was possible to learn language morphology rules from small numbers of examples, or to learn visual concepts much more efficiently than with traditional machine learningEllisST15.


In general, there are three major challenges one has to address when working with program synthesis. In a recent paper, we refer to these as the Three Pillars of machine programmingGottschlichSTCR18.

Intention. The first challenge is what we have termed the Intention challenge: how do the users tell you their goals? The definition of synthesis talks about semantic and syntactic constraints, but the exact form of these will influence all subsequent decisions about the synthesis system. The success of the FlashFillgulwani:2011:flashfill system has popularized the use of input-output examples as a means of specification, but input-output examples are not suitable for every task. In our own work on storyboard programming, we advocated for an approach to multi-modal synthesis, where concrete examples were combined with abstract examples and logical specifications, so that together they provided enough information about the intended behavior to produce a working implementation of a data-structure manipulationSinghS12.

One big aspect of the intention challenge is how to cope with under-specification. If there are multiple programs that satisfy the requirements, how can we tell which one the user actually wants? Of course one solution is to simply ignore this problem, if the user provides a partial specification, they have no right to complain if they get a different program from the one they wanted. In practice, though, making a good choice can make the difference between a system that is useful, and one that is not.

Invention. The second challenge once we know what the user wants is to actually discover a piece of code that will satisfy those requirements. Arguably this is the central challenge of synthesis, as it potentially involves inventing new algorithmic solutions to a problem. One of the key questions we will be dealing with in this course are different techniques that the community has developed to tackle the inherent complexity of this task.

Adaptation The canonical view of synthesis is that the user is creating a brand new algorithm from scratch, and wants to leverage a synthesizer to create a correct implementation of the desired algorithm. However, most software development involves working in the context of existing software systems, fixing bugs, optimizing code, and performing other kinds of maintenance tasks. This pillar deals with the question of synthesis in a broader context, and the application of synthesis ideas to broader software development tasks beyond green-field software creation. As we will see later, there are a number of compelling applications of program synthesis in support of the broader software development process.