Introduction to Program Synthesis

© Armando Solar-Lezama. 2018. All rights reserved.

Lecture 21: Synthesizing Under a Distribution

Lecture21:Slide4; Lecture21:Slide5; Lecture21:Slide6; Lecture21:Slide7 In the previous lecture, we established that the Bayesian view of the synthesis problem is a strict generalization of the simple view we have been studying all along. As one would expect, such a generalization can make the synthesis problems much more difficult to solve. In this lecture we will describe a few different methods that can be used to synthesize programs under particular classes of program distributions.

Length minimization

We start by considering the program distribution described earlier where the probability of the program decays as its length increases, and we focus for now on the uniform $P(evidence | f)$ that assigns a probability of zero to any evidence that is inconsistent with the function. As was mentioned in the last lecture, finding the most likely program under these distributions reduces to finding the shortest program under some definition of "short".

When representing programs as ASTs, a common way to characterize length is based on the height of the AST. Some of the techniques we discussed so far already guarantee length minimization. For example, the basic bottom-up search strategy described in Lecture 3 already guarantees that the shortest program will be found first. The only requirement is that when two observationally equivalent programs are found, the algorithm must always keep the shortest one. It is important to note, however, that some of the improvements to the basic bottom-up algorithm discussed in Lecture 3 no longer guarantee length minimization. For example, the STUN approach needs to trade off the size of the solution in a given branch with the strength of the predicate guarding that branch, and that tradeoff may lead to programs that are not minimal.

The top-down search approach from Lecture 4 does not immediately guarantee minimality of the solution, but it is relatively easy to force it to do so. Recall from Lecture 4 that the algorithm works by expanding partially constructed solutions; at each step of the algorithm, there is a wavefront of partially constructed solutions that are ready to be expanded, and it is up to the algorithm to decide which one to expand next. Expanding programs in a breath-first manner, so that smaller programs get expanded before larger programs can be used to guarantee that the minimal solution is found first.

Lecture21:Slide8; Lecture21:Slide9; Lecture21:Slide10; Lecture21:Slide11; Lecture21:Slide12; Lecture21:Slide13; Lecture21:Slide14; Lecture21:Slide15; Lecture21:Slide16 Incorporating a minimality constriant in the context of symbolic search is trickier. For example, consider the constraint-based synthesis approach from Lecture 8. This approach, combined with CEGIS is used as the basis of the Sketch synthesis system, but CEGIS by itself does not impose a minimality constraint. To understand how this approach can be generalized to enforce minimality, consider the basic principle behind CEGIS. The key idea behind CEGIS is that the synthesis constraint $\exists \phi. \forall in. Q(\phi, in)$ can be weakened to focus only on a small subset of inputs $E$. The weakened constraint is more efficient for a solver, since the inner $\forall$ quantification reduces to a simple conjunction. The weakened constraint allows some incorrect solutions, which must be caught by the checker that is part of CEGIS.

This same basic idea was generalized in a paper by Sing, Gulwani and Solar-Lezama Singh:2013 to handle minimality constraints. As the figure indicates, the minimality constraint adds additional quantifiers to the formula to ensure that the generated solution is better than all other alternative solutions. Just like with CEGIS, this can be weakened to consider only alternative solutions in a carefully selected set $G$. The key observation is that if the set $G$ contains only solutions that are already known to satisfy the correctness constraint $Q$, this can lead to dramatic simplification of the formula.

This suggests a generalization of CEGIS where every time the basic CEGIS algorithm finds a solution $\phi$ that satisfies the correctness constraint for all inputs, this solution is added to the set G, and the inductive synthesizer is then asked to satisfy the additional constraint that any new solution it produces must be strictly better than $\phi$. This means that unlike the normal CEGIS, which terminates when the checking phase produces UNSAT, this algorithm must keep going. Having the checker produce UNSAT simply gives us a new estimate for the best solution. The algorithm only stops when the inductive synthesizer yields UNSAT. This means that there is no solution that is strictly better than the current best solution and that works for even the restricted sample of inputs $E$.

Search with probabilistic grammars

Lecture21:Slide17; Lecture21:Slide18; Lecture21:Slide19; Lecture21:Slide20 The approaches for length minimization we have described so far work with relatively few changes for some richer representations of the distribution. For example, one kind of representation involves probabilistic grammars. A simple form of a probabilistic grammar is one where every production has a probability, and the probabilities of all the productions for a given non-terminal add up to one. Under this formalism, the probability of an expression generated by the grammar is the product of the probabilities of all the production rules used to create the expression.

Generating an expression that maximizes the probability under this scheme is very similar from generating an expression that minimizes length. For example, from the perspective of bottom-up search, this distributions of this kind have the advantage that they are compositional and context insensitive. This means that for every equivalence class of behaviorally equivalent expressions we only need keep the one with the maximum probability. How do we know that by keeping only the expression with maximum probability we are not missing out on an expression that may prove useful later? We need to prove that the expression $exp$ that this algorithm produces is indeed the highest probability expression. We can prove this by induction on the depth of expressions. We can show that for any depth $i$, the algorithm will maintain the highest probability expression for each equivalence class of behaviors. For $i=1$ this is trivial, because the algorithm is precisely keeping the highest probability expression for each equivalence class. For $i>1$ we can use the inductive hypothesis, so we know that any expression $exp$ we construct will have as it's sub-expressions the highest probability expressions in that equivalence class, and because the probability of $exp$ is just the probability of the root node times the probabilities of the sub-expressions, maximizing the probability of the subexpressions maximizes the probability of the overall tree.

This type of probabilistic grammar also works well with top-down search, as well as with constraint-based techniques. In the case of constraint-based techniques, it is often useful to take the logs of the probabilities so that they are added instead of multiplied, and to convert them into integers through scaling and rounding.

Context sensitive probabilistic grammars

Lecture21:Slide21; Lecture21:Slide22; Lecture21:Slide23; Lecture21:Slide24; Lecture21:Slide25; Lecture21:Slide26; Lecture21:Slide27 A richer kind of probabilistic grammar is one where the probabilities depend on the context as illustrated in the figure. In this setting, the probability of a given production depends on where the production is used. This kind of probabilistic grammar poses a problem for bottom-up search, because given two equivalent ASTs, it may no longer be possible to determine that one has strictly higher probability than the other. For example, consider the case illustrated in the figure. The probability of the expressions $x*2$ and $x+x$ depends on the context. In some contexts, the first one will have a higher probability and in other contexts, the second one will, so if the bottom-up algorithm wants to be able to produce the highest probability expression it needs to keep them both.

Now, one of the advantages of eliminating expressions based on observational equivalence is that eliminating a single expression early on can have exponential benefits, since it also eliminates the exponentially large number of expressions that can be constructed from that expression. So an important question is: Now that we are forced to keep multiple equivalent expressions, will that make the algorithm exponentially more expensive? The answer is surprisingly NO!

To understand why, consider again the example with the two expressions, $x*2$ and $x+x$. For every expression that is constructed with $x*2$, there will now be an equivalent expression that uses $x+x$ instead. However, we will not actually be keeping all these equivalent expressions. An expression only needs to be kept if there is some context in which it is better than all other equivalent expressions. This means that for every equivalence class, we need to keep at most $N$ expressions, where $N$ is the number of distinct contexts in the grammar. This is worst than the original algorithm which only had to keep one representative from every equivalence class, but only by a constant factor.

In general, though, what we find is that once we go beyond simple distributions, and in particular once we have context dependent probabilities, top-down search becomes the best option for finding expressions that maximize the probability.

Learning probabilistic grammars

Lecture21:Slide28; Lecture21:Slide29; Lecture21:Slide30; Lecture21:Slide31; Lecture21:Slide32; Lecture21:Slide33 A probabilistic grammar is defined by a table of parameters, one for each non-terminal and each context where the non-terminal may appear. Given a corpus of trees from a grammar, it is easy to set up the problem of learning the weights from a corpus as an optimization problem as illustrated in the figure. The problem actually reduces to a linear optimization problem over the logs of the probabilities in the table, under the constraint that the probabilities from each row in the table must add to one.

More recent work has sought to learn conditional distributions for the grammar as a function of the specification. For example, the DeepCoder project of Balog et al. BalogGBNT17 uses a neural network to encode certain features of the specification into a vector, and then decodes it into a vector where each entry corresponds to the probability of each element in the grammar. Such a network can then be trained to identify which components are more likely to appear in a program given a particular specification.

Sampling from a distribution

Lecture21:Slide35 So far we have been assuming that the goal is to find a single most likely program, but what if we are actually interested in sampling from the distribution? As before, we focus on the case where $P(evidence | f)$ is zero when $f$ does not fit the evidence and uniform otherwise. In this case, a simple but inefficient way to proceed is rejection sampling basically, the idea is to sample from the prior $P(f)$, and then for every $f$ that is drawn, we check to see if it is consistent with the evidence. If it is not, it simply gets rejected and a new sample is drawn. Rejection sampling is simple to implement and understand, but it can be extremely inefficient in contexts where most programs are not consistent with the evidence. For even the simplest synthesis problems, it can take millions of trails before a single $f$ is drawn that is consistent with the evidence. The sampling techniques we studied in Lecture 5 can generally do much better than simple rejection sampling, but generally require a carefully chosen proposal distribution and can still struggle with "needle in a haystack" problems where most candidate programs have probability zero.

In recent years, there has been significant progress on the problem of sampling the solution space of a set of constraints. The basic ideas go back to a paper by Bellare, Goldreich and PetrankBELLARE2000510, and were further refined by Gomez, Sabharwal and SelmanGomesNIPS2006 and most recently by Chakraborty, Meel and VardiChakraborty2013. In a paper in 2016, Ellis, Solar-Lezama and Tenenbaum applied these ideas to the problem of sampling the solutions to a synthesis problem, essentially applying the ideas from these prior works in the context of constraint-based synthesisEllisNIPS2016.

Lecture21:Slide36; Lecture21:Slide37; Lecture21:Slide38 The figure aims to illustrate some of the key ideas behind the approach. The rounded rectangle corresponds to the space of possible programs, where each program is represented by a set of parameters $\phi$. The light-green regions in the figure correspond to those parameters that satisfy the correctness constraint, which we represent as a predicate $Q(\phi)$. The set of correct programs satisfying $Q(\phi)$ is potentially very small relative to the overall set of possible solutions, which is why rejection sampling is likely to fail. Moreover, there is no reason to believe that the correct solutions would be distributed uniformly over the space, in fact, they are most likely to be concentrated in small patches of programs that are only different in small cosmetic ways.

The first important idea behind this class of approaches is to use a hash function $h$ (or more often a set of randomized hash functions) to map from the original space $\Phi$ to a new space $\Theta$. The effect of hashing is that the solutions that used to be concentrated in small patches in $\Phi$ are now scattered uniformly throughout the space $\Theta$.

The second important idea is to partition the space $\Theta$ into a set of regions of uniform size. Sampling then happens in two phases: in the first phase, a partition is chosen at random, and in the second phase, the solver is asked to chose a valid solution within the chosen partition. The key observation is that in the limit as the size of each partition becomes smaller, this scheme approaches rejection sampling, since each partition will have only a small number of points and most partitions are likely to contain no valid solutions. With bigger partitions, however, the probability that a random partition will contain a solution increases. The sweet spot happens when each partition contains one solution on average, because then every time a random partition is chosen, the solver will find a solution within that partition. If the partitions get even bigger, though, the quality of the sampling degrades, because if a partition contains more than one solution, the choice of which solution within the partition gets picked becomes arbitrary; nothing prevents the solver from always picking the same solution, for example.

The basic intuition outlined above is complicated a bit by the need to have the constraints be tractable. In particular, the basic scheme outlined above requires the constraint solver to invert a hash function, which can be extremely challenging even for state-of-the-art SAT solvers. Practical schemes reduce the step of picking a hash function and a partition within the hash space $\Theta$ to simply adding a carefully constructed set of randomized XOR-constraints to the original problem.

Another complication involves what happens when the goal is not to sample uniformly, but to sample according to a more complex distribution $P(f)$. This problem was addressed by Chakraborty, Fremont, Meel, Seshia and Vardi ChakrabortyFMSV14 in the case of general SAT problems. In the case of programs, the Ellis et al. paper mentioned earlier also addressed this problem for simple classes of distributionsEllisNIPS2016.