Lecture 20: A Bayesian View of Synthesis

Lecture20:Slide2 Up to this point, our focus with synthesis has been to find a program that satisfies its specification. However, in many situations, satisfying the specification is not sufficient. This is because in many contexts, providing complete specifications that rule out every possible undesirable program is very challenging. For example, we already discussed the programming-by-example setting, where the specification consists of a series of examples or demonstrations. In general, for any finite set of examples there will be an infinite set of programs that can satisfy them. In Unit 1, we discussed two main approaches to avoiding this problem: (a) restricting the language and (b) prioritizing small programs. In this Unit, we will describe a more general approach to dealing with underspecification by relying on probabilities.

Throughout this section, we will be relying heavily on Bayes theorem, which will allow us to speak formally about our belief in the relative likelyhood that different programs are the program desired by the user, and to update those beliefs in the face of new evidence.

For example, consider the programming-by-example setting. Given a set of input/output examples, the goal is to find a function that matches those examples. We can see those input/output examples as a form of evidence, and we can frame the synthesis problem as the problem of finding the function that has the highest probability given the evidence. As the figure illustrates, we can use Bayes rule to compute the probability of a function given the evidence in terms of the probability of the evidence given the function times a background probability or prior for the function itself.

A subtle but important point is that for the purpose of finding an optimal $f$, we can ignore $P(evidence)$, since it does not depend on $f$. However, this only works if $P(evidence) \neq 0$. In theory we can assume that any evidence for which $P(evidence)$ is zero will never be drawn (if it can be drawn, its probability cannot be zero). Providing any evidence for which $P(evidence) = 0$ will make the problem ill posed.

This formulation of the problem can be seen as a strict generalization of the original program synthesis formulation. For example, suppose that all the I/O examples were picked uniformly at random and the desired outputs were captured precisely; then we have that $P(evidence | f)$ is uniform among all examples that are consistent with the function and zero for all examples inconsistent with $f$. This means that any $f$ inconsistent with the evidence will have probability zero. Moreover, if the prior probability $P(f)$ is uniform over all programs in a restricted language, then what we have is that all programs inconsistent with the examples will have probability zero, and all programs consistent with them will be equally likely, so any program satisfying the examples will be a valid solution to the synthesis problem. So under this prior and this $P(evidence | f)$, the problem reduces to the standard synthesis problem we have been studying all along.

Now, in synthesis it has often been helpful to prioritize smaller programs because they are less likely to contain extraneous details not required to satisfy the examples. In fact, some of the methods we studied early in the course organize the search to find such short programs first. This desire for short programs can be framed in terms of a prior by replacing the uniform probability on $f$ with one that decreases monotonically with the length of the program. In the case where you are looking for the maximum likelihood program under the uniform $P(evidence | f)$ we have discussed so far, it actually does not matter how exactly $P(f)$ decreases with length, since we will still be looking for the smallest possible program that is consistent with the evidence. However, if you are going to sample from this distribution, it is important to keep in mind that there are exponentially many more long programs than short programs, so unless the probability decays exponentially with length, you will still be more likely to sample a long program than a short program.

Synthesis under errors

The bayesian approach can also be extended to deal with the case of learning from noisy data. In this case, the probability $P(evidence | f)$ needs to be updated to account for the fact that evidence inconsistent with the function no longer has probability zero. For example, an interesting case to consider is the case where off-by-one errors are possible in the data. The figure illustrates a possible distribution under such an assumption. An interesting observation is that the possibility of errors in the data introduces a necessary tradeoff between the probability of a function and the amount of error it generates. Recall that earlier when we described the probabilities decreasing by length we mentioned that as long as the probability decreased monotonically with length, the exact form of the distribution did not matter if we were looking for the most likely function. This is no longer the case once we are in a setting where the probability of a function must be traded off against the probability of the evidence given the function. Specifically, a distribution that causes the probability of $f$ to drop very precipitously with larger programs will push the synthesizer to pick a very short program even at the expense of failing to match some inputs exactly. On the other hand, a distribution that decreases very slowly may lead the synthesizer to prioritize matching the examples precisely.

Ellis et al.Ellis2018 explored such a tradeoff between the prior over functions and the output error in their 2018 paper on synthesizing drawing programs from hand-drawn diagrams. This paper uses a neural network to translate a hand-drawn diagram into straight-line code that issues individual drawing commands, and then relies on the Sketch synthesizer to produce a program that is semantically equivalent to the generated sequence of commands. A prior over the space of programs favors programs with loops rather than long sequences of individual drawing commands, and penalizes branches that special-case individual loop iterations.

This prior over programs can be traded off against a simple error model that tolerates small errors in the drawing commands, allowing the synthesizer to make up for small errors in perception. This is illustrated on the second panel of the figure on the right. The first column shows two hand drawings, and the middle column shows the result of running the commands generated by the neural network. The last column shows the result of the most likely program, which trades off accuracy against the result of the neural network in exchange for producing a simpler (and therefore higher likelyhood) program.

Unsupervised Learning

Another generalization of the simple programming-by-example synthesis problem is the unsupervized learning problem. Unsupervised learning has long been studied by the machine learning community, but it was more recently studied as a synthesis problem by Ellis, Tenenbaum and Solar-LezamaEllisST15. Unsupervized learning can be seen as a programming-by-example problem where the system is given only the outputs of the unknown function, so the goal is to find a function and a series of inputs $[in_i]_{i \lt N}$, one for each output $[out_i]_{i \lt N}$, such that $f(in_i)=out_i$.

Without probabilities, this problem is hopelessly underspecified, since one can always trivially chose $f(x)=x$ and $in_i = out_i$ as a solution that satisfies the constraints. However, framed as a probabilistic problem, the goal is to find $f$ and $[in_i]_{i \lt N}$ that maximize the probability of the function and the inputs given the observed outputs. Using Bayes rule, this probability can be decomposed into a product of simpler distributions as illustrated in the figure.

One feature of the unsupervized learning problem is that both the inputs and the function are unknown. However, in many situations, it is really only the function that we are interested in. So one obvious question is whether one should marginalize over the inputs or not. In other words, should one optimize to find the $(f, [in_i]_i)$ that maximize $P(f, [in_i]_i | [out_i]_i)$ or to maximize $P(f | [out_i]_i)=\sum_{[in_i]_i} P(f, [in_i]_i | [out_i]_i)P([in_i]_i)$.

Marginalization over the inputs is more expensive than simply optimizing the joint distribution, especially for symbolic methods where the integration/summation over all possible inputs is potentially intractable. Moreover, for many applications, the inputs themselves are valuable, as they reflect important information about the process one wants to understand, so there is a strong incentive to optimize the joint distribution. All the examples in the original Ellis et al. paperEllisST15 were solved using the joint distribution.