Lecture 20: A Bayesian View of Synthesis
Lecture20:Slide2
Up to this point, our focus with synthesis has been to find a program
that satisfies its specification. However, in many situations,
satisfying the specification is not sufficient. This is because in many
contexts, providing complete specifications that rule out every possible
undesirable program is very challenging. For example, we already discussed
the
programming-by-example setting, where
the specification consists of a series of examples or demonstrations. In general,
for any finite set of examples there will be an infinite set of programs
that can satisfy them. In Unit 1, we discussed two main approaches to avoiding this
problem: (a) restricting the language and (b) prioritizing small programs. In this
Unit, we will describe a more general approach to dealing with underspecification
by relying on
probabilities.
Throughout this section, we will be relying heavily on Bayes theorem, which
will allow us to speak formally about our belief in the relative
likelyhood that different programs are the program desired by the user, and
to update those beliefs in the face of new evidence.
For example, consider the programming-by-example setting. Given a set of
input/output examples, the goal is to find a function that matches those examples.
We can see those input/output examples as a form of
evidence, and we can
frame the synthesis problem as the problem of finding the function that
has the highest probability given the evidence. As the figure illustrates,
we can use Bayes rule to compute the probability of a function given the evidence
in terms of the probability of the evidence given the function times a background
probability or
prior for the function itself.
Lecture20:Slide3;
Lecture20:Slide4;
Lecture20:Slide5;
Lecture20:Slide6;
Lecture20:Slide7
A subtle but important point is that for the purpose of finding an optimal $f$, we
can ignore $P(evidence)$, since it does not depend on $f$. However, this only works if
$P(evidence) \neq 0$. In theory we can assume that any evidence for which $P(evidence)$
is zero will never be drawn (if it can be drawn, its probability cannot be zero).
Providing any evidence for which $P(evidence) = 0$ will make the problem ill posed.
This formulation of the problem can be seen as a strict generalization of the
original program synthesis formulation. For example, suppose that all
the I/O examples were picked uniformly at random and the desired outputs
were captured precisely;
then we have that $P(evidence | f)$ is uniform among all examples that are
consistent with the function and zero for all examples inconsistent with $f$.
This means that any $f$ inconsistent with the evidence will have probability
zero. Moreover, if the
prior probability $P(f)$ is uniform over all programs in a restricted
language, then what we have is that all programs inconsistent with the examples
will have probability zero, and all programs consistent with them will be equally likely,
so any program satisfying the examples will be a valid solution to the synthesis problem.
So under this prior and this $P(evidence | f)$, the problem reduces to the
standard synthesis problem we have been studying all along.
Now, in synthesis it has often been helpful to prioritize smaller programs
because they are less likely to contain
extraneous details not required to satisfy the examples.
In fact,
some of the methods we studied early in the course
organize the search to find such short programs first.
This desire for short programs can be framed in terms of a prior
by replacing the uniform probability on $f$ with one that decreases monotonically
with the length of the program. In the case where you are looking for
the
maximum likelihood program under the uniform $P(evidence | f)$ we have discussed
so far, it actually does not matter how exactly $P(f)$ decreases with length,
since we will still be looking for the smallest possible program that is consistent
with the evidence. However, if you are going to sample from this distribution, it is
important to keep in mind that there are exponentially many more long programs than
short programs, so unless the probability decays exponentially with length, you will still
be more likely to sample a long program than a short program.
Synthesis under errors
Lecture20:Slide10;
Lecture20:Slide11;
Lecture20:Slide12;
Lecture20:Slide13;
The bayesian approach can also be extended to deal with the
case of learning from noisy data. In this case, the probability $P(evidence | f)$
needs to be updated to account for the fact that
evidence inconsistent with the function no longer has probability zero.
For example, an interesting case to consider is the case where
off-by-one errors are possible in the data.
The figure illustrates a possible distribution under such an assumption.
An interesting observation is that the possibility of errors
in the data introduces a necessary tradeoff between the probability
of a function and the amount of error it generates.
Recall that earlier when we described the probabilities decreasing by length
we mentioned that as long as the probability decreased monotonically with length,
the exact form of the distribution did not matter if we were looking for the most likely function. This is no longer the case once we are in a
setting where the probability of a function must be traded off against
the probability of the evidence given the function. Specifically,
a distribution that causes the probability of $f$ to drop very precipitously with
larger programs will push the synthesizer to pick a very short program
even at the expense of failing to match some inputs exactly. On
the other hand, a distribution that decreases very slowly may lead the synthesizer
to prioritize matching the examples precisely.
Lecture20:Slide15;
Lecture20:Slide16
Ellis et al.
Ellis2018 explored such a tradeoff between the prior over functions and the output
error in their 2018 paper on synthesizing drawing programs from hand-drawn diagrams.
This paper uses a neural network to translate a hand-drawn diagram into straight-line
code that issues individual drawing commands, and then relies on the Sketch synthesizer
to produce a program that is semantically equivalent to the generated sequence of commands.
A prior over the space of programs favors programs with loops rather than long sequences of
individual drawing commands, and penalizes branches that special-case individual loop iterations.
This prior over programs can be traded off against a simple error model that tolerates small
errors in the drawing commands, allowing the synthesizer to make up for small errors in perception.
This is illustrated on the second panel of the figure on the right. The first column shows two hand drawings,
and the middle column shows the result of running the commands generated by the neural network.
The last column shows the result of the most likely program, which trades off accuracy against the
result of the neural network in exchange for producing a simpler (and therefore higher likelyhood) program.
Unsupervised Learning
Lecture20:Slide17;
Lecture20:Slide18;
Lecture20:Slide19
Another generalization of the simple programming-by-example synthesis problem is
the unsupervized learning problem.
Unsupervised learning has long been studied by the machine learning community,
but it was more recently studied as a synthesis problem by Ellis, Tenenbaum and
Solar-Lezama
EllisST15. Unsupervized learning can be seen
as a programming-by-example problem where the system is given only
the outputs of the unknown function, so the goal is to find a function
and a series of inputs $[in_i]_{i \lt N}$, one for each output $[out_i]_{i \lt N}$, such
that $f(in_i)=out_i$.
Without probabilities, this problem is hopelessly underspecified, since one can always
trivially chose $f(x)=x$ and $in_i = out_i$ as a solution that satisfies the constraints.
However, framed as a probabilistic problem, the goal is to find $f$ and $[in_i]_{i \lt N}$
that maximize the probability of the function and the inputs given the observed outputs.
Using Bayes rule, this probability can be decomposed into a product of simpler
distributions as illustrated in the figure.
To marginalize or not to marginalize.
Lecture20:Slide20
One feature of the unsupervized learning problem is that both the inputs and the function are unknown. However,
in many situations, it is really only the function that we are interested in. So one obvious question is whether
one should marginalize over the inputs or not. In other words, should one optimize to find the $(f, [in_i]_i)$
that maximize $P(f, [in_i]_i | [out_i]_i)$ or to maximize
$P(f | [out_i]_i)=\sum_{[in_i]_i} P(f, [in_i]_i | [out_i]_i)P([in_i]_i)$.
Marginalization over the inputs is more expensive than simply optimizing the joint distribution, especially for symbolic
methods where the integration/summation over all possible inputs is potentially intractable.
Moreover, for many applications, the inputs themselves are valuable, as they reflect important information about the
process one wants to understand, so there is a strong incentive to optimize the joint distribution.
All the examples in the original Ellis et al. paper
EllisST15 were solved using the joint distribution.