Lecture 20: A Bayesian View of Synthesis

Lecture20:Slide2 Up to this point, our focus with synthesis has been to find a program that satisfies its specification. However, in many situations, satisfying the specification is not sufficient. This is because in many contexts, providing complete specifications that rule out every possible undesirable program is very challenging. For example, we already discussed the programming-by-example setting, where the specification consists of a series of examples or demonstrations. In general, for any finite set of examples there will be an infinite set of programs that can satisfy them. In Unit 1, we discussed two main approaches to avoiding this problem: (a) restricting the language and (b) prioritizing small programs. In this Unit, we will describe a more general approach to dealing with underspecification by relying on probabilities.

Throughout this section, we will be relying heavily on Bayes theorem, which will allow us to speak formally about our belief about the relative likelyhood that different programs are the program desired by the user, and to update those beliefs in the face of new evidence.

Lecture20:Slide3; Lecture20:Slide4; Lecture20:Slide5; Lecture20:Slide6; Lecture20:Slide7 For example, consider the programming-by-example setting. Given a set of input/output examples, the goal is to find a function that matches those examples. We can see those input/output examples as a form of evidence, and we can frame the synthesis problem as the problem of finding the function that has the highest probability given the evidence. As the figure illustrates, we can use Bayes rule to compute the probability of a function given the evidence in terms of the probability of the evidence given the function times a background probability or prior for the function itself.

A subtle but important point is that for the purpose of finding an optimal $f$, we can ignore $P(evidence)$, since it does not depend on $f$. However, this only works if $P(evidence) \neq 0$. In theory we can assume that any evidence for which $P(evidence)$ is zero will never be drawn (if it can be drawn, its probability cannot be zero). Providing any evidence for which $P(evidence) = 0$ will make the problem ill posed.

This formulation of the problem can be seen as a strict generalization of the original program synthesis formulation. For example, suppose that all the I/O examples were picked uniformly at random and the desired outputs were captured precisely; then we have that $P(evidence | f)$ is uniform among all examples that are consistent with the function and zero for all examples inconsistent with $f$. This means that any $f$ inconsistent with the evidence will have probability zero. Moreover, if the prior probability $P(f)$ is uniform over all programs in a restricted language, then what we have is that all programs inconsistent with the examples will have probability zero, and all programs consistent with them will be equally likely, so any program satisfying the examples will be a valid solution to the synthesis problem. So under this prior and this $P(evidence | f)$, the problem reduces to the standard synthesis problem we have been studying all along.

Now, in synthesis it has often been helpful to prioritize smaller programs because they are less likely to contain extraneous details not required to satisfy the examples. In fact, some of the methods we studied early in the course organize the search to find such short programs first. This desire for short programs can be framed in terms of a prior by replacing the uniform probability on $f$ with one that decreases monotonically with the length of the program. For the uniform $P(evidence | f)$ we have discussed so far, it actually does not matter how exactly $P(f)$ decreases with length, since we will still be looking for the smallest possible program that is consistent with the evidence.

Synthesis under errors

Lecture20:Slide10; Lecture20:Slide11; Lecture20:Slide12; Lecture20:Slide13; The bayesian approach can also be extended to deal with the case of learning from noisy data. In this case, the probability $P(evidence | f)$ needs to be updated to account for the fact that evidence inconsistent with the function no longer has probability zero. For example, an interesting case to consider is the case where off-by-one errors are possible in the data. The figure illustrates a possible distribution under such an assumption. An interesting observation is that the possibility of errors in the data introduces a necessary tradeoff between the probability of a function and the amount of error it generates. Recall that earlier when we described the probabilities decreasing by length we mentioned that as long as the probability decreased monotonically with length, the exact form of the distribution did not matter. This is no longer the case once we are in a setting where the probability of a function must be traded off against the probability of the evidence given the function. Specifically, a distribution that causes the probability of $f$ to drop very precipitously with larger programs will push the synthesizer to pick a very short program even at the expense of failing to match some inputs exactly. On the other hand, a distribution that decreases very slowly may lead the synthesizer to prioritize matching the examples precisely.

Unsupervised Learning

Lecture20:Slide17; Lecture20:Slide18; Lecture20:Slide19 Another generalization of the simple programming-by-example synthesis problem is the unsupervized learning problem. Unsupervised learning has long been studied by the machine learning community, but it was recently studied as a synthesis problem by Ellis, Tenenbaum and Solar-LezamaEllisST15. Unsupervized learning can be seen as a programming-by-example problem where the system is given only the outputs of the unknown function, so the goal is to find a function and a series of inputs $[in_i]_{i \lt N}$, one for each output $[out_i]_{i \lt N}$, such that $f(in_i)=out_i$.

Without probabilities, this problem is hopelessly underspecified, since one can always trivially chose $f(x)=x$ and $in_i = out_i$ as a solution that satisfies the constraints. However, framed as a probabilistic problem, the goal is to find $f$ and $[in_i]_{i \lt N}$ that maximize the probability of the function and the inputs given the observed outputs. Using Bayes rule, this probability can be decomposed into a product of simpler distributions.

Introduction to Program Synthesis

Lecture 20: A Bayesian View of Synthesis

Synthesis under errors

Unsupervised Learning