Lecture 22: Learning Distributions and Neural Guided Synthesis

The strategies discussed in the previous lecture work well when synthesizing under relatively simple distributions. For more complex distributions, however, more sophisticated machinery becomes necessary. In this lecture, we explore several techniques that have been borrowed from the world of natural language processing (NLP) and which have proven to be successful in the context of program synthesis.

Distributions over programs, lessons from NLP

Many of the techniques used in program synthesis to represent and learn complex distributions over programs come from the world of natural language processing. To understand the key ideas behind these models, consider the simple problem of representing a distribution over sequences of tokens. In the world of NLP, such a sequence will usually correspond to a sentence, but in the world of program synthesis, it will correspond to a program.

Lecture22:Slide2; Lecture22:Slide3; Lecture22:Slide4 n-gram models. One of the simplest models for representing such a distribution over sequences is the n-gram model, where the probability of each token is assumed to depend on a small window of preceding tokens. So if a sentence is represented as a sequence of words $[w_0, w_1, \ldots, w_k]$. Then an n-gram model assumes that for every word, the probability of that word given all the preceding words depends only on the previous n-words. $$ P(w_i | w_{i-1}, w_{i-2}, \ldots w_0) = P(w_i | w_{i-1}, w_{i-2}, \ldots w_{i-n}) $$ For small n, n-gram models are easy to learn, but they have limited expressive power. For example, consider the sentence

The big brown bear scares the children with its roar

In the sentence above, the word 'bear' is strongly determined by the previous two words, but in order to predict the word 'roar' with any accuracy, you would have to look back at least six words to know that you are talking about a bear.

Recurrent models. One way to get around this problem is to have a formalism that can learn its own context, instead of relying on a short window of previous tokens.

Lecture22:Slide4; Lecture22:Slide5; Lecture22:Slide6; Lecture22:Slide7; Lecture22:Slide8 $$ P(word_i | context_{i-1}) \\ context_i = f(word_i, context_i-1) $$ In this case, the idea is that the context at every step is a function of the current word and the context up to that point, and the distribution over words at every step depends exclusively on this computed context.

A popular way of representing such models is by using a neural network. At its most basic level, a neural network is just a parametric function built by composing matrix multiplications with non-linear operations---these are often the function $max(in, 0)$, known as a rectified linear unit or RELU, or S shaped functions such as $tanh$ or sigmoid.

The simplest way to instantiate the above model with neural networks is the recurrent neural network. In this model, a vector is used to represent both the context and the current word. This vector is passed through a neural network to compute a new vector representing the new context, and a distribution over words.

The simple recurrent neural network has been superseded by more sophisticated models that are easier to train. One especially popular model is the Long Short Term Memory (LSTM)LSTM, illustrated in the figure. The basic principle is the same one; the difference is in the way the context is propagated. Instead of simply pushing the context through the full network, the context is incrementally modified by first scaling it down by the result of a forget gate, and then additively updated by the result of an input layer gate. As a result, gradients can propagate better past multiple stages of the computation, allowing the networks to be trained with longer sequences of text.

Trees vs strings.

Both n-grams and the different recurrent models we have discussed so far are defined on sequences of tokens, which is a convenient representation for natural language text. In the case of programs, however, there is some debate in the community about the relative merits of string vs. tree representations.

On the one hand, tree representations are a more natural way of representing programs, and all the sequence-based models outlined above have analogs that operate on trees. In fact, the probabilistic grammars that we saw in the previous lecture are essentially tree generalizations of the n-gram model. One big advantage of tree-based models is that every program that is sampled is guaranteed to conform to the grammar, unlike a sequence based model, which may produce programs that do not even parse.

On the other hand, sequence-based models tend to be simpler and more efficient to train, so they are quite common in this literature.

Conditional distributions

Lecture22:Slide11; Lecture22:Slide12 One of the strengths of using neural networks for sampling distributions is that it is easy to generalize to conditional distributions, where the distribution depends on some evidence or observation. The evidence can be a set of input/output examples, or it can be something less precise, such as natural-language text. $$ P(program | evidence) $$ One approach to encoding these distributions is to use an encoder-decoder model, where a recurrent network is used to encode the evidence into a vector, and then the vector is used as the initial context for a decoder network. The basic encoder-decoder model shown in the figure has major limitations that derive from the fact that all the evidence about the entire sentence must be encoded into a single vector. A better approach is to use attention.

Attention addresses a fundamental question in neural networks of how to encode an unbounded amount of information. In a traditional RNN, all the information about each of the tokens in the evidence is collected into a single vector that is then decoded into the output. The key insight behind attention is that there is some degree of locality, where different tokens of the output depend primarily on small subsets of tokens from the input, so instead of encoding the entire input into a single fixed vector, the attention mechanism allows each output token to pay attention to a different subset of input tokens. The details of this mechanism are illustrated in the figure. <p>Your browser does not support iframes.</p>

Notice how instead of aggregating the evidence into a single vector, this setup computes a weight for each intermediate vector from the input and then computes a weighted sum for each token in the output. This means that each step in the output will get a different weighted sum corresponding to the input tokens that are more or less relevant for computing that output.

One of the first systems to use attention for program synthesis was RobustFill RobustFill. This paper proposed and compared several different architectures for encoding inductive synthesis problems for the domain of text editing. Specifically, the goal of this paper is to produce a neural-guided synthesis version of FlashFill. One of the important observations of this paper was that an instance of a synthesis problem takes as input a set of input/output pairs, each of which involves two strings (the input and the output) of unbounded length, so there are actually some design decisions involved in how to handle that. The most successful approach used by that paper was to use attention within an individual input/output pair, but then aggregate over the distributions proposed from each of the examples. Lecture22:Slide14; Lecture22:Slide15

This choice of using attention within an example and pooling accross example works well in this domain because there is actually a fair bit of locality between the programs and the input/output examples. For example, consider the following example and program. in: "Armando Solar-Lezama" out: "A. Solar-Lezama" Program: Concat(SubString(in, Pos("", Word), Pos(Char,"")), ". ", SubString(in, Pos(" ",Word), Pos("", End)); The program consists of three parts, an expression that extracts the first initial, concatenated with a constant, and then with an expression that extracts everything after the first space. Notice, however, that each of these parts can be inferred independently by focusing on particular regions of the input and output strings. By contrast, if you were trying to synthesize complicated arithmetic expressions, for example, there would not be so much locality to exploit, so the benefits of using attention would be significantly reduced.

Searching with a learned distribution

One of the ways in which program synthesis differs from NLP applications is in the presence of hard constraints: at the end of the day, we want to ensure that the program we get actually satisfies the specification, whether that specification is given by a set of input/output examples or by a logic formula describing pre and postconditions. What this means is that in general, it is not going to be enough to simply search for the most likely program in the learned distribution, we will want to use the learned distribution to guide a search for a program that actually satisfies the specification.

There are several different mechanisms that have been proposed for this in the literature. A popular one that works well with the representations described so far is to conduct a beam search. The basic idea behind beam search is that at every stage of the search process, you keep a "beam" of the $k$ most likely strings considered so far. For each of those $k$ beams, you consider their possible expansions and keep the $k$ most likely ones as illustrated in the figure.

Lecture22:Slide16; Lecture22:Slide17; Lecture22:Slide18; Lecture22:Slide19; Lecture22:Slide20 One major limitation of beam search is that, compared with the kind of search techniques we have been exploring so far in the course, beam search is very expensive. A beam of size 100 can already be very expensive to search, because it involves evaluating an expensive neural network 100 times for every token in the program and then if you have a branching factor of $n$ in your search (i.e. you have $n$ possible next tokens at every step), you have to rank $100 \times n$ candidates at every step. In contrast, even simple implementations of bottom-up search can consider millions of semantically distinct candidates in a few seconds.

This difference in performance means that neural guided search can only compete if you can learn a sufficiently accurate distribution so you can find the target program even with very limited search.

In the rest of this lecture, we will explore two different directions to address this problem, the first is to try to make the search more accurate, while the second one aims to combine the benefits of neural guided search with the search strategies we have been covering in this course.

Introduction to Program Synthesis

Lecture 22: Learning Distributions and Neural Guided Synthesis

Distributions over programs, lessons from NLP

Trees vs strings.

Conditional distributions

Searching with a learned distribution

Combining neural guided search with more classical techniques