Lecture 21: Synthesizing Under a Distribution
Lecture21:Slide4;
Lecture21:Slide5;
Lecture21:Slide6;
Lecture21:Slide7
In the previous lecture, we established that the Bayesian view of the synthesis
problem is a strict generalization of the simple view we have been studying all
along. As one would expect, such a generalization can make the synthesis problems
much more difficult to solve. In this lecture we will describe a few different
methods that can be used to synthesize programs under particular classes of program
distributions.
Length minimization
We start by considering the program distribution described earlier where
the probability of the program decays as its length increases, and we focus
for now on the uniform $P(evidence | f)$ that assigns a probability of
zero to any evidence that is inconsistent with the function.
As was mentioned in the last lecture, finding the most likely program
under these distributions reduces to finding the shortest program
under some definition of "short".
When representing programs as ASTs, a common way to characterize length
is based on the height of the AST. Some of the techniques we discussed
so far already guarantee length minimization. For example, the
basic bottom-up search strategy described in
Lecture 3
already guarantees that the shortest program will be found first. The only
requirement is that when two observationally equivalent programs are found,
the algorithm must always keep the shortest one.
It is important to note, however, that some of the improvements to the
basic bottom-up algorithm discussed in Lecture 3 no longer guarantee
length minimization. For example, the
STUN
approach needs to trade off the size of the solution in a given branch with
the strength of the predicate guarding that branch, and that tradeoff may
lead to programs that are not minimal.
The top-down search approach from
Lecture 4 does not
immediately guarantee minimality of the solution, but it is relatively easy to
force it to do so. Recall from Lecture 4 that the algorithm works by expanding
partially constructed solutions; at each step of the algorithm, there is a
wavefront of partially constructed solutions that are ready to be expanded,
and it is up to the algorithm to decide which one to expand next. Expanding
programs in a breath-first manner, so that smaller programs get expanded before larger programs
can be used to guarantee that the minimal solution is found first.
Lecture21:Slide8;
Lecture21:Slide9;
Lecture21:Slide10;
Lecture21:Slide11;
Lecture21:Slide12;
Lecture21:Slide13;
Lecture21:Slide14;
Lecture21:Slide15;
Lecture21:Slide16
Incorporating a minimality constriant in the context of symbolic search is trickier.
For example, consider the constraint-based synthesis approach from
Lecture 8.
This approach, combined with
CEGIS is used
as the basis of the Sketch synthesis system, but CEGIS by itself
does not impose a minimality constraint. To understand how this approach can be generalized to
enforce minimality, consider the basic principle behind CEGIS.
The key idea behind CEGIS is that the synthesis constraint $\exists \phi. \forall in. Q(\phi, in)$
can be weakened to focus only on a small subset of inputs $E$. The weakened constraint
is more efficient for a solver, since the inner $\forall$ quantification reduces to a
simple conjunction. The weakened constraint allows some incorrect solutions, which must
be caught by the checker that is part of CEGIS.
This same basic idea was generalized in a paper by Sing, Gulwani and Solar-Lezama
Singh:2013
to handle minimality constraints. As the figure indicates, the minimality constraint
adds additional quantifiers to the formula to ensure that
the generated solution is better than all other alternative solutions.
Just like with CEGIS, this can be weakened to consider only
alternative solutions in a carefully selected set $G$.
The key observation is that if the set $G$ contains only
solutions that are already known to satisfy the correctness
constraint $Q$, this can lead to dramatic simplification of
the formula.
This suggests a generalization of CEGIS where every time
the basic CEGIS algorithm finds a solution $P$ that satisfies
the correctness constraint for all inputs, this solution
is added to the set G, and the inductive synthesizer
is then asked to satisfy the additional constraint that
any new solution it produces must be strictly better than
$P$. This means that unlike the normal CEGIS, which
terminates when the checking phase produces UNSAT,
this algorithm must keep going. Having the checker produce
UNSAT simply gives us a new estimate for the best solution.
The algorithm only stops when the inductive synthesizer yields
UNSAT. This means that there is no solution that is strictly
better than the current best solution and that works for
even the restricted sample of inputs $E$.
Search with probabilistic grammars
Lecture21:Slide17;
Lecture21:Slide18;
Lecture21:Slide19;
Lecture21:Slide20
The approaches for length minimization we have described so
far work with relatively few changes for some
richer representations of the distribution.
For example, one kind of representation involves probabilistic
grammars. A simple form of a probabilistic grammar is one
where every production has a probability, and the probabilities
of all the productions for a given non-terminal add up to one.
Under this formalism, the probability of an expression
generated by the grammar is the product of the probabilities
of all the production rules used to create the expression.
Generating an expression that maximizes the probability
under this scheme is very similar from generating an
expression that minimizes length. For example, from
the perspective of bottom-up search, this class of distributions
have the advantage that they are compositional and context
insensitive. This means that for every equivalence class of
behaviorally equivalent expressions we only need keep the one
with the maximum probability.
How do we know that by keeping only the expression with maximum
probability we are not missing out on an expression that may
prove useful later? We need to prove that
the expression $exp$ that this algorithm produces is indeed
the highest probability expression. We can prove this by
induction on the depth of expressions. We can show
that for any depth $i$, the algorithm will maintain
the highest probability expression for each equivalence
class of behaviors. For $i=1$ this is trivial, because
the algorithm is precisely keeping the highest
probability expression for each equivalence class.
For $i>1$ we can use the inductive hypothesis, so
we know that any expression $exp$ we construct
will have as it's sub-expressions the highest
probability expressions in that equivalence class,
and because the probability of $exp$ is just
the probability of the root node times
the probabilities of the sub-expressions,
maximizing the probability of the subexpressions
maximizes the probability of the root node.
This type of probabilistic grammar also works well
with top-down search, as well as with constraint-based techniques.
In the case of constraint-based techniques, it is often useful
to take the logs of the probabilities so that they are added instead
of multiplied, and to convert them into integers through scaling
and rounding.
Context sensitive probabilistic grammars
Lecture21:Slide21;
Lecture21:Slide22;
Lecture21:Slide23;
Lecture21:Slide24;
Lecture21:Slide25;
Lecture21:Slide26;
Lecture21:Slide27;
Lecture21:Slide28
A richer kind of probabilistic grammar is one where the probabilities depend on the
context as illustrated in the figure. In this setting, the probability of a given production
depends on where the production is used. This kind of probabilistic grammar
poses a problem for bottom-up search, because given two equivalent ASTs, it may no longer
be possible to determine that one has strictly higher probability than the other.
For example, consider the case illustrated in the figure. The probability of the expressions
$x*2$ and $x+x$ depends on the context. In some contexts, the first one will have
a higher probability and in other contexts, the second one will, so if the
bottom-up algorithm wants to be able to produce the highest probability expression it needs
to keep them both.
Now, one of the advantages of eliminating expressions based on observational equivalence is that
eliminating a single expression early on can have exponential benefits, since it also eliminates
the exponentially large number of expressions that can be constructed from that expression.
So an important question is: Now that we are forced to keep multiple equivalent expressions,
will that make the algorithm exponentially more expensive? The answer is surprisingly NO!
To understand why, consider again the example with the two expressions, $x*2$ and $x+x$.
For every expression that is constructed with $x*2$, there will now be an equivalent expression
that uses $x+x$ instead. However, we will not actually be keeping all these equivalent
expressions. An expression only needs to be kept if there is some context in which it is better than
all other equivalent expressions. This means that for every equivalent class, we need to keep
at most $N$ expressions, where $N$ is the number of distinct contexts in the grammar. This
is worst than the original algorithm which only had to keep one representative from every
equivalence class, but only by a constant factor.
In general, though, what we find is that once we go beyond simple distributions,
and in particular once we have context dependent probabilities, top-down search
becomes the best option for finding expressions that maximize the probability.
Sampling from a distribution
Lecture21:Slide29
So far we have been assuming that the goal is to find a single most likely program,
but what if we are actually interested in sampling from the distribution?
As before, we focus on the case where $P(evidence | f)$ is zero when $f$ does not
fit the evidence and uniform otherwise. In this case, a simple but inefficient way to
proceed is
rejection sampling basically, the idea is to sample from
the prior $P(f)$, and then for every $f$ that is drawn, we check to see if it
is consistent with the evidence. If it is not, it simply gets rejected and
a new sample is drawn. Rejection sampling is simple to implement and understand,
but it can be extremely inefficient in contexts where most programs are not
consistent with the evidence. For even the simplest synthesis problems, it can
take thousands of trails before a single $f$ is drawn that is consistent with
the evidence.
In recent years, there has been significant progress on the problem of
sampling the solution space of a set of constraints. The basic ideas
go back to a paper by Bellare, Goldreich and Petrank
BELLARE2000510, and were further
refined by Gomez, Sabharwal and Selman
GomesNIPS2006 and most recently by Chakraborty,
Meel and Vardi
Chakraborty2013. In a paper in 2016, Ellis, Solar-Lezama and Tenenbaum
applied these ideas to the problem of sampling the solutions to a synthesis
problem, essentially applying the ideas from these prior works in the
context of constraint-based synthesis
EllisNIPS2016.
Lecture21:Slide30;
Lecture21:Slide31;
Lecture21:Slide32
The figure aims to illustrate some of the key ideas behind the approach.
The green rectangle corresponds to the space of possible programs, where
each program is represented by a set of parameters $\phi$. The light-green
regions in the figure correspond to those parameters that satisfy the correctness
constraint, which we represent as a predicate $Q(\phi)$. The set of
correct programs satisfying $Q(\phi)$ is potentially very small
relative to the overall set of possible solutions, which is why rejection
sampling is likely to fail. Moreover, there is no reason to believe that
the correct solutions would be distributed uniformly over the space,
in fact, they are most likely to be concentrated in small patches of programs
that are only different in small cosmetic ways.
The first important idea behind this class of approaches is to use a hash function $h$ (or
more often a set of randomized hash functions) to map from the original
space $\Phi$ to a new space $\Theta$. The effect of hashing is that
the solutions that used to be concentrated in small patches in $\Phi$
are now scattered uniformly throughout the space $\Theta$.
The second important idea is to partition the space $\Theta$ into a
set of regions of uniform size. Sampling then happens in two phases:
in the first phase, a partition is chosen at random, and in the
second phase, the solver is asked to chose a valid solution within
the chosen partition. The key observation is that in the limit
as the size of each partition becomes smaller, this scheme approaches
rejection sampling, since each partition will have only a small
number of points and most partitions are likely to contain no
valid solutions. With bigger partitions, however, the probability
that a random partition will contain a solution increases.
The sweet spot happens when each partition contains one solution on
average, because then every time a random partition is chosen,
the solver will find a solution within that partition.
If the partitions get even bigger, though, the quality of the
sampling degrades, because if a partition contains more than one
solution, the choice of which solution within the partition
gets picked becomes arbitrary; nothing prevents the solver from
always picking the same solution, for example.
The basic intuition outlined above is complicated a bit by the
need to have the constraints be tractable. In particular,
the basic scheme outlined above requires the constraint
solver to invert a hash function, which can be extremely challenging even
for state-of-the-art SAT solvers.
Practical schemes reduce the step of picking a hash function and a partition
within the hash space $\Theta$ to simply adding a carefully constructed
set of randomized XOR-constraints to the original problem.
Another complication involves what happens when the goal is not to
sample uniformly, but to sample according to a more complex distribution
$P(f)$. This problem was addressed by Chakraborty, Fremont, Meel, Seshia and Vardi
ChakrabortyFMSV14
in the case of general SAT problems. In the case of programs,
the Ellis et al. paper mentioned earlier also addressed this problem
for simple classes of distributions
EllisNIPS2016.