Lecture 22: Learning Distributions and Neural Guided Synthesis
The strategies discussed in the previous lecture work well when synthesizing
under relatively simple distributions. For more complex distributions, however,
more sophisticated machinery becomes necessary.
In this lecture, we explore several techniques that have been borrowed from
the world of natural language processing (NLP) and which have proven to be
successful in the context of program synthesis.
Distributions over programs, lessons from NLP
Many of the techniques used in program synthesis to represent and learn
complex distributions over programs come from the world of natural language
processing. To understand the key ideas behind these models, consider the
simple problem of representing a distribution over sequences of tokens.
In the world of NLP, such a sequence will usually correspond to a sentence,
but in the world of program synthesis, it will correspond to a program.
Lecture22:Slide2;
Lecture22:Slide3;
Lecture22:Slide4
n-gram models. One of the simplest models for representing such a
distribution over sequences is the
n-gram model, where the probability of each token is assumed to
depend on a small window of preceding tokens.
So if a sentence is represented as a sequence of words $[w_0, w_1, \ldots, w_k]$.
Then an n-gram model assumes that for every word, the probability
of that word given all the preceding words depends only on the previous n-words.
$$
P(w_i | w_{i-1}, w_{i-2}, \ldots w_0) = P(w_i | w_{i-1}, w_{i-2}, \ldots w_{i-n})
$$
For small n, n-gram models are easy to learn, but they have limited
expressive power. For example, consider the sentence
The big brown bear scares the children with its roar
In the sentence above, the word 'bear' is strongly determined by the
previous two words, but in order to predict the word 'roar' with any
accuracy, you would have to look back at least six words to know
that you are talking about a bear.
Recurrent models. One way to get around this problem is to have a formalism that can
learn its own context, instead of relying on a short window of
previous tokens.
Lecture22:Slide4;
Lecture22:Slide5;
Lecture22:Slide6;
Lecture22:Slide7;
Lecture22:Slide8
$$
P(word_i | context_{i-1}) \\
context_i = f(word_i, context_{i-1})
$$
In this case, the idea is that the context at every step is a
function of the current word and the context up to that point,
and the distribution over words at every step depends exclusively
on this computed context.
A popular way of representing such models is by using a
neural network.
At its most basic level, a neural network is just a parametric function built
by composing matrix multiplications with non-linear operations---these are often
the function $max(in, 0)$, known as a rectified linear unit or RELU, or S shaped
functions such as $tanh$ or sigmoid.
The simplest way to instantiate the above model with neural networks is the
recurrent neural network.
In this model, a vector is used to represent both the context and the current word.
This vector is passed through a neural network to compute a new vector representing the
new context, and a distribution over words.
The simple recurrent neural network has been superseded by more
sophisticated models that are easier to train. One especially popular model is the
Long Short Term Memory (LSTM)
LSTM, illustrated in the figure.
The basic principle is the same one; the difference is in the way the context is
propagated. Instead of simply pushing the context through the full network, the
context is incrementally modified by first scaling it down by the result
of a
forget gate, and then additively updated by the result of an
input layer gate.
As a result, gradients can propagate better past multiple stages of the computation,
allowing the networks to be trained with longer sequences of text.
Trees vs strings.
Both n-grams and the different recurrent models we have discussed so far are defined on sequences of
tokens, which is a convenient representation for natural language text. In the case of programs, however,
there is some debate in the community about the relative merits of string vs. tree representations.
On the one hand, tree representations are a more natural way of representing programs, and all the
sequence-based models outlined above have analogs that operate on trees.
In fact, the probabilistic grammars that we
saw in the previous lecture are essentially tree generalizations
of the n-gram model. One big advantage of tree-based models is that every program that is sampled
is guaranteed to conform to the grammar, unlike a sequence based model, which may produce programs
that do not even parse.
On the other hand, sequence-based models tend to be simpler and more efficient to train, so they
are quite common in this literature.
Conditional distributions
Lecture22:Slide11;
Lecture22:Slide12;
Lecture22:Slide13
One of the strengths of using neural networks for sampling distributions is that it is easy to generalize
to conditional distributions, where the distribution depends on some evidence or observation.
The evidence can be a set of input/output examples, or it can be something less precise, such as natural-language
text.
$$
P(program | evidence)
$$
One approach to encoding these distributions is to use an encoder-decoder model, where a recurrent network is used to
encode the evidence into a vector, and then the vector is used as the initial context for a decoder
network. Such sequence-to-sequence models were popular for language translation starting in
2014 with the work of Sutskever, Vinyals and Le
Sutskever14, and are broadly applicable to other tasks
such as question answering, or in our context, program synthesis.
There is a broad literature of similar
models that vary in how the encoding and
decoding is done. For example, a common strategy is to encode the sequence using multi-layer bidirectional
models that propagate information about the sequence both front-to-back and back-to-front.
Sequence-to-sequence models with attention
The basic encoder-decoder model shown in the figure has major limitations that derive from the fact that
all the evidence about the entire sentence must be encoded into a single vector. For this reason, they were quickly
extended with
Attention-based models, first introduced in this context by Bahdanau, Cho and Bengio
BahdanauCB14.
The key insight behind attention is that
there is some degree of locality, where different tokens of the output depend primarily on small subsets of
tokens from the input, so instead of encoding the entire input into a single fixed vector, the attention mechanism
allows each output token to pay attention to a different subset of input tokens.
This means that it is no longer necessary to pack all the information from the input sequence into a single vector;
instead, the network dynamically selects the relevant information for each output token.
The details of this mechanism are
illustrated in the figure.
The input sequence is encoded by a bidirectional recurrent network, but instead of aggregating the values into a single
vector, the values from each word in the input are passed to the attention mechanism. For each token in the output,
the network computes a vector called a
query that is combined with the vector from each input token to produce a weight
for each intermediate vector from the input. The output of the attention mechanism is a vector computed as a weighted sum for each token in the output.
This means that each step in the output will get a different weighted sum corresponding to the input tokens that are more or less
relevant for computing that output.
One of the first systems to use attention for program synthesis was RobustFill
RobustFill.
This paper proposed and compared several different architectures for encoding inductive synthesis problems
for the domain of text editing. Specifically, the goal of this paper is to produce a neural-guided synthesis
version of
FlashFill.
One of the important observations of this paper was that an instance of a synthesis problem takes as
input a set of input/output pairs, each of which involves two strings (the input and the output) of
unbounded length, so there are actually some design decisions involved in how to handle that.
The most successful approach used by that paper was to use attention within an individual input/output
pair, but then aggregate over the distributions proposed from each of the examples.
Lecture22:Slide15;
Lecture22:Slide16
This choice of using attention within an example and pooling accross example works well in this domain
because there is actually a fair bit of locality between the programs
and the input/output examples. For example, consider the following example and program.
in: "Armando Solar-Lezama"
out: "A. Solar-Lezama"
Program: Concat(SubString(in, Pos("", Word), Pos(Char,"")), ". ", SubString(in, Pos(" ",Word), Pos("", End));
The program consists of three parts, an expression that extracts the first initial, concatenated with
a constant, and then with an expression that extracts everything after the first space.
Notice, however, that each of these parts can be inferred independently by focusing on particular regions of the
input and output strings. By contrast, if you were trying to synthesize complicated arithmetic expressions,
for example, there would not be so much locality to exploit, so the benefits of using attention would be significantly
reduced.
Transformers: Preliminaries
The basic use of attention described earlier still suffers from a couple of important problems. First, while we are now using the attention
mechanism to avoid having to encode the entire
input as a single vector, the decoder is still a recurrent neural network that propagates
information step by step, so that on a long output, we will still have the same problem of having to encode all relevant information about the
output in a fixed size vector. Additinally, the encoder itself still relies on a recurrent network to compute the individual vectors that will
be passed to the attention mechanism.
In an aptly titled paper in 2017
Vaswani17, Vaswani et al. argued that all of those recurrent networks could be replaced with attention
mechanisms in order to get a more scalable network. Their proposed architecture is called a Transformer and it forms the basis of all large-scale
language models including GPT-3 and Codex.
A second look at attention
To better understand transformers, we need a better understanding of the attention mechanism itself. One way to understand the attention mechanism
is as a key-value store. A key value store gets as input a database of key-value pairs and a query, and its goal is to find the key that is closest
to the query and return the value associated with that key. Now, suppose I have a collection of keys $K=[k_0, k_1, \ldots k_i]$, and a set of values
$V = [v_0, v_1, \ldots, v_i]$. Now, if every key is a column vector, I can represent the collection as a matrix $K$ where each column is a key.
Multiplying a query vector $q$ times $K^T$ will yield a vector whose $i^{th}$ entry corresponds to the dot product of $q$ and $k_i$, so higher values
will correspond to a closer match between key and query. We can apply a softmax to that vector so that the smallest values will get closer to zero and the largest values will
get closer to one. Multiplying that vector will provide us with a linear combination of the values whose keys were closest to $q$.
\[
V\cdot softmax(K^{T}\cdot q)
\]
The attention mechanism does exactly this with one small difference: The vector $q$, the matrix $K$ and the vector $V$ are each multiplied by a separate
weight matrix.
\[
Attention(q, K, V) = W_v \cdot V\cdot softmax(W_k \cdot K^{T}\cdot W_q\cdot q)
\]
In the attention network presented earlier, the query corresponded to the query vector that came from the output RNN (which encodes the output produced so far),
while the Keys and Values both corresponded to the vectors coming from the input RNN.
Multi-headed attention
In some cases, the same Query, Keys and Values can be passed through multiple attention functions in parallel, each with its own weights to produce a collection
of vectors.
\[
MHAttention(q,K,V) = [Attention_0(q, K, V), Attention_1(q, K, V), \ldots, Attention_k(q, K, V)]
\]
This is called multi-headed attention.
Self-attention
We have already seen an example where the Keys and Values are actually the same collection of vectors. We can use the same vectors as Queries as well; this is known
as
self-attention. Given a collection of vectors $H = [h_0, \ldots, h_k]$, we can compute
\[
MHAttention(h_i, H, H)
\]
The effect of self attention is to allow each vector in $H$ to be translated into a matrix that incorporates information about all other vectors in $H$.
Why would we want this? Well, in the language translation problem, for example, you often want to encode each word into a vector that represents the meaning
of that word, but some words can have multiple meanings, so you need some context to decide which meaning to choose. This means that a good encoding should
not just map each word to a vector independently but should pay attention to all the other words in the context. So for example, if you have the sentence
"The sheep were happy to be back in their pen.", an encoding of the word
pen needs to pay attention to the other words in the sentence, especially the
word
sheep, because that determines how
pen should be encoded. In particular, we want an encoding that will distinguish from the use of
pen
in the sentence "For this writer, no other posession was more valuable than her pen."
Transformers: How they work
With these concepts defined, we can now describe the key ideas behind the transformer architecture.
Self-Attention for the input
The first idea is that we are going to compute a set of vectors for each token in the input,
but we are going to do this
using self-attention instead of the bidirectional recurrent network that was used by the earlier Bahdanau paper.
Using attention has the advantage of avoiding long chains of dependencies and is therefore
more scalable, but it introduces a small complexity: The attention mechanism is insensitive
to the order of the Key/Value vectors. So when replacing the RNN with the self-attention mechanism,
we need a way to incorporate position information into the process so that the network
can take into account the actual order of words and their proximity to each other when
computing their vectors. This is generally done by adding to the vector for each word
a vector that encodes position information.
Self-Attention for the output
The second idea is that instead of using a recurrent neural network to track the output
computed so far, the Transformer will use another multi-headed self-attention module to
encode the output so far. This will also require encoding position information straight
into the vectors using the same mechanism as before.
Attention for combining inputs and current output together
The transformer will then use another attention module to combine together the vectors from the
input with the vectors from the output so far. In particular, the vectors coming from the input
will work as Keys and Values, while the vectors coming from the so-far-computed output
are used as queries.
Additional machinery.
These are the key ideas behind transformer networks, but the detailed architecture has some additional complexity. For example, in the
encoder, there are multiple
layers of self-attention stacked on top of each other with pass-through layers and small feed forward layers in between them.
In the decoder, there are also multiple layers that combine the output so far with the Keys and Values computed from the input,
and these are also interleved with pass through layers and feed forward layers, but the ideas outlined here are the main
ideas that make the transformer unique.
Searching with a learned distribution
One of the ways in which program synthesis differs from NLP applications is in the presence of hard constraints:
at the end of the day, we want to ensure that the program we get actually satisfies
the specification, whether that specification is given by a set of input/output examples or by
a logic formula describing pre and postconditions. What this means is that in general, it is not going to be enough to simply search
for the most likely program in the learned distribution, we will want to use the learned distribution to guide a search for a program
that actually satisfies the specification.
There are several different mechanisms that have been proposed for this in the literature.
A popular one that works well with the representations described so far is to conduct a
beam search.
The basic idea behind beam search is that at every stage of the search process, you keep a "beam" of the
$k$ most likely strings considered so far. For each of those $k$ beams, you consider their possible expansions
and keep the $k$ most likely ones as illustrated in the figure.
Lecture22:Slide17;
Lecture22:Slide18;
Lecture22:Slide19;
Lecture22:Slide20;
Lecture22:Slide21
An interesting recent example of neural-guided search is AlphaCode
AlphaCode, a system that is able to
solve problems from programming competitions. AlphaCode uses a Transformer that takes as input the problem description
and produces a candidate program. The transformer was first pre-trained on a large
GitHub dataset and then fine-tuned on a dataset of programming competition problems. In its most aggressive setting,
AlphaCode gathers 1 million samples from the neural model and then filters out all the ones that fail the tests.
The system filters
observationally equivalent solutions in order to return to the user a
small set of unique solutions that pass all tests.
Combining Neural-Guided and Symbolic Search
One major limitation of beam search is that, compared with the kind of search techniques we have been
exploring so far in the course, beam search is very expensive. A beam of size 100 can already be very
expensive to search, because it involves evaluating an expensive neural network 100 times for every token in the program
and then if you have a branching factor of $n$ in your search (i.e. you have $n$ possible next tokens
at every step), you have to rank $100 \times n$ candidates at every step. In contrast, even simple implementations
of bottom-up search can consider millions of semantically distinct candidates in a few seconds.
In 2019, Nye et al.
NyeHTS19 proposed a strategy to combine neural guided search with a faster symbolic
search mechanism. The ideas was to combine the benefits of the symbolic search strategies from Unit 1 — which can
exhaustively search the space of programs very efficiently, but are limited in how deep into the search space they can go —
with the benefits of neural guidance — which can go very deep in the search space but can miss solutions that are slightly
off from the learned distribution.
The high-level idea is to train the neural network to leave holes in place of program fragments that are difficult to predict
so that the synthesis of those fragments can be delegated to the symbolic search procedure. During training, the
network gets a maximum reward for producing the correct code, and will get a very high penalty for producing an incorrect
solution. But, if the neural network choses to produce a placeholder hole, it will receive a small penalty depending on the size
of the code replaced by the hole. So a small hard-to-get-right constant can be safely replaced by a hole without incurring a
significant penalty, but the penalty grows larger if a hole is used in place of a larger piece of code.
The experiments showed that the scheme was particularly effective when the test set was somewhat different from the training distribution, or
when the network was trained on a small test set.