Lecture 6: Version Space Algebras from SMARTedit to FlashFill.

Up to this point, we have been exploring a series of enumerative search techniques; what made these techniques enumerative was that we were explicitly constructing ASTs one-by-one in the process of exploring the space. For the rest of this unit, we will be switching gears to a different class of search techniques based on symbolic representations of program spaces. The key idea behind these techniques is that instead of enumerating ASTs one-by-one we have some data-structures (or symbols) that concisely represent entire sets of programs. By manipulating these symbols, we can efficiently eliminate large sets of possible programs.

Lecture6:Slide3; Lecture6:Slide4; Lecture6:Slide5; Lecture6:Slide6 It is worth emphasizing that the distinction between enumerative and symbolic search techniques is not as crisp as it may appear at first sight. In fact, while we have been focusing so far on enumerative techniques, the reader may have noted that some of the techniques we have covered contain already some symbolic aspects. The basic bottom-up search and the stochastic search strategies were very clearly enumerative. Both techniques construct and evaluate complete ASTs one at a time before accepting or rejecting them. However, the Hierarchical bottom-up techniques and the top-down search do have some symbolic aspects to them. In the hierarchical search we described in Lecture 3, we were performing enumeration over a language of queries with holes, where each query with a hole actually represented an entire space of queries. Similarly, in the top-down search, each query with an un-expanded non-terminal actually represented the whole set of possible expansions.

In this lecture, we will be covering a class of techniques that also sit close to the boundary, but which rely heavily on the ability to compactly represent very large sets of programs and to manipulate those representations in order to get a concrete program.

Version Spaces

Lecture6:Slide9 We start the lecture with a discussion of the Version Space approach to inductive synthesis as described by Lau, Wolfman, Domingos, Weld on their paper on programming by demonstration Lau2003. In this paper, the space of programs is called a Hypothesis Space ($H$), and given a dataset of inputs and outputs $D=\{(in_i, out_i)\}$, a Version Space $VS_{H,D}$ is the subset of $H$ corresponding to programs that satisfy the examples in the given dataset.

A Version Space Algebra (VSA) defines a set of operations that allow us to manipulate and compose version spaces. What really characterizes the VSA-based approaches is the use of compact symbolic representations for the version spaces, which allow us to manipulate them efficiently.

Lattice-based version spaces

One important class of representations are Lattice-based representations. At a high-level, a lattice is a partially ordered set where we can define a unique least upper bound (lub) and a greatest lower bound (glb) for any pair of elements $x$ and $y$, where a least upper bound is the smallest element that is greater than or equal to $x$ and $y$ and the greatest lower bound is the largest element that is less than or equal to $x$ and $y$. For example, the integers form a lattice, where $lub(x,y) = max(x,y)$ and $glb(x,y)=min(x,y)$. Another example of a lattice is the set of all predicates over some set of variables $p(x)$. We can define a partial order over this set by saying that $p \leq q \mbox{ iff } p \Rightarrow q$. Then, given two predicates $p$ and $q$, we can see that $lub(p, q) = p \vee q$ and $glb(p,q) = p \wedge q$.

Now, one common way of defining a range of integers is by its endpoints, so we can use the notation $[a, b]$ to represent the set of integers $\{x | a \leq x \leq b\}$, assuming that $a \leq b$. It turns out the same idea can apply to other lattices; given two elements $p, q \in H$ belonging to some lattice $H$, if $p \leq q$, then we can define the set $[p, q] = \{x \in H | p \leq x \mbox{ and } x \leq q \} $. In the version space literature, it is said that a version space is Boundary Set Representable if the hypothesis space is a lattice and the version space can be represented as a range in terms of its two endpoints $[S, G]$. The letter $S$ is used to refer to the most specific hypothesis, and the letter $G$ to the most general. This idea of using endpoints in a lattice to represent a set of concepts goes back to Mitchell's seminal paper on version spaces back in 1982Mitchell82.

Example: An example from Lau et al. is a class of functions FindSuffix(T). This class of functions is parameterized by a string $T$, and is used in the text editing domain to move the cursor right before the next occurrence of the string $T$. Now, consider the following piece of text from Churchill's famous speech.

We shall go on to the end. We ↓1shall fight in France, we ↓2shall fight on the seas and oceans, we ↓3shall fight with growing confidence and growing strength in the air.

Suppose we move the cursor from the position marked as 1, right before the word shall, to position 2. The movement can be implemented as an application of FindSuffix(T) for many different strings T, including T="s", T="shall", T="shall fight on the seas and oceans" and more.

The important thing to note is that we can define a lattice over the strings $T$ based on the following operations. \[ \begin{array}{ll} T_1 \leq T_2 & \mbox{iff } T_1 \mbox{ prefix } T_2 \\ glb(T_1, T_2) =& \mbox{longest common prefix of }T_1 \mbox{ and } T_2\\ lub(T_1, T_2) = & \mbox{shortest string that has } T_1 \mbox{ and } T_2 \mbox{ as prefix} \end{array} \] The reader may notice that the structure above is not quite a lattice, because the $lub$ operation is not well defined for all inputs. For example, $lub($"hello","world"$)$ is not well defined because there is no string that has both words as a prefix. A common trick to address this problem for lattices is to define a special value $\top$ (pronounced "top") that is defined to be greater than all other values. In this case, $\top$ would be a special string with the property that $\forall T. T \mbox{ prefix } \top$. Then it is clear that $lub($"hello","world"$)=\top$.

With this lattice, we can see that the space of functions FindSuffix(T) is boundary set representable. The set of functions consistent with moving the cursor from position 1 to position 2 is concisely represented by the range ["s", "shall fight on the seas and oceans...in the air."].

Now, suppose we then observe the cursor moving from position 2 to position 3. This second movement is also consistent with a set of FindSuffix functions which can be represented with the range ["sh", "shall fight with growing confidence and growing strength in the air."]. In this case, note that the set does not include "s", because the function FindSuffix("s") would have stopped right before "seas" instead of jumping all the way to position 3, so it is not consistent with the example.

Now we mentioned before that one important aspect of symbolic representations is the ability to manipulate them efficiently. In general, for boundary set representable version spaces, we can easily compute intersections as shown below: \[ [a_l, a_h] \cap [b_l, b_h] = [lub(a_l, b_l), glb(a_h, b_h)] \] So in this case, if we want the set of programs that are consistent with both demonstrations, we can represent that set concisely as ["sh", "shall fight "], since "sh" is the shortest string that has both "s" and "sh" as a prefix, and "shall fight " is the longest common prefix of the two upper bounds.

Version Space Algebra

Lattice-based version spaces are great if you have them, but can also be very restrictive. The real power of the Version Space Algebra approach is the ability to symbolically represent complex compositions of simpler version spaces. For example, two powerful operations for combining version spaces are Union and Join explained below.

Union. The first form of composition is union $VS_{H_1, D} \cup VS_{H_2, D}$. As an example, consider the version space for FindSuffix presented earlier. There is an analogous FindPrefix(T) that finds the position in the text after the next occurrence of the string $T$. FindPrefix also has a lattice associated with it, except it is defined in terms of suffixes instead of prefixes. Now, we saw earlier that given two examples, one from the cursor from position 1 to 2, and one from 2 to 3, you could get the space FindSuffix(["sh", "shall fight "]). Well, by a similar process, we can see that the two examples are consistent with the space FindPrefix(["we", ", we"]). So the union of these two version spaces can be represented simply as FindSuffix(["sh", "shall fight "]) U FindPrefix(["we", ", we"]) . If we added an additional example of jumping from the beginning of the text to position 1, the added example would have a version space of the form FindSuffix(["shall f", "shall fight in France...air"]) U FindPrefix(["We", "We shall go on to the end. We"]) When we take the intersection of the two version spaces, we can take the intersection independently for the two kinds of version spaces and we get FindSuffix(["shall f", "shall fight "]) U ∅ We now have an empty set for FindPrefix, because there is no string that has both "we " and "We " as suffixes, so the glb of the two is $\top$.

Join. This is a form of relational join for version spaces defined as follows:

\[ \begin{array}{c} VS_{H_1, D_1} \bowtie VS_{H_2, D_2} = \{ (h_1, h_2) | h_1 \in VS_{H_1, D_1}, h_2 \in VS_{H_2, D_2}, C((h_1, h_2), D) \}\\ \mbox{where} \\ D=\{(d_1^i, d_2^i)\} \mbox{ given } D_1 = \{d_1^i\}\mbox{ and } D_2=\{d_2^i\}. \end{array} \]

The function $C$ in the definition is meant to stand for a consistency check that can be used to select which pairs in the cross product can actually be combined together.

In the case of PBD, Join can be used to describe sequences of operations together, especially when the demonstration allows us to separately provide evidence for the different actions in the sequence. Continuing with our running example, suppose that each of the jumps of the cursor, from position 1 to position 2, to position 3 were treated as part of a sequence of actions, rather than independent examples corresponding to the same program. In that case, we could model this as a three way join of actions, and maintain a version space for each of the joins.

Overall, the key idea in the version space algebra framework is that we have symbolic representations of sets of progrmas. The representations may look like ASTs, but they are not ASTs representing programs. Instead, they are ASTs representing the algebraic operations over sets of programs that led to the current set.

Version Spaces meet E-Graphs in FlashFill

The general framework of version space algebras is ultimately only as powerful as the underlying representations used for the individual version spaces. The work of Lau and her collaborators built a complex space of text manipulation functions from lattice based representations and compositions thereof, but ultimately they were not very successful in providing sufficiently reliable automation. Around 2010, though, Sumit Gulwani and his collaborators revisited some of these ideas armed with more powerful program representations that enabled a careful language design aimed at a narrower class of text manipulations and developed the FlashFill system which was eventually incorporated into Excel Officegulwani:2011:flashfill SinghG12.

A language for text manipulation.

The first key ingredient in the flashfill system was the design of the language. The language is described in terms of two levels of abstraction. The first level consists of Trace Expressions defined according to the grammar below:

\[ \newcommand{\Loop}{\mbox{Loop}} \newcommand{\SubStr}{\mbox{SubStr}} \newcommand{\Pos}{\mbox{Pos}} \newcommand{\Cat}{\mbox{Concat}} \begin{array}{rcl} \mbox{Trace expression} & e :=& \Cat(f_1, \ldots, f_n) ~ | ~ f \\ \mbox{Atomic expression} & f :=& \mbox{ConstStr}(s) ~~~ \mbox{String constant } s \\ ~ & ~ & |~ \mbox{SubStr}(v_i, p_1, p_2)~~~ \mbox{Produces a substring of } v_i \mbox{ that lies between the positions indicated by } p_1 \mbox{ and } p_2 \\ ~ & ~ & |~ \mbox{Loop}(\lambda w. e) ~~~ \mbox{Looping construct where } e \mbox{ is the loop body and } w \mbox{ the loop index}.\\ \mbox{Position} & p := & CPos(k) ~~~\mbox{Constant position } k\\ ~ & ~ & Pos(r_1, r_2, c) ~~~\mbox{The } c^{th} \mbox{ position that has r1 as prefix and r2 as suffix}\\ \mbox{Integer expr} & c := & k ~|~ k_1 * w + k ~~~~\mbox{where } w \mbox{ is a loop index} \\ \mbox{Regular expr} & r := & T ~|~ TokenSeq(T_1, \ldots, T_k) | \epsilon \end{array} \]

The langauge above is very expressive. For example, in our running example, suppose I want to collect all phrases that appear between a "shall" and a punctuation mark, including the word shall. We could write such an expression as

\[ \Loop(\lambda w. \SubStr(in, \Pos(``", ``shall", w), \Pos(``",PunctuationTok, w) ) ) \]

The Loop construct iterates the expression until one of the terms becomes invalid. The expression itself finds the $w^{th}$ substring that starts with "shall" and ends right before a punctuation token. If we wanted to separate those with commas, we could use the program below instead:

\[ \Loop(\lambda w. \Cat(\SubStr(in, \Pos(``", ``shall", w), \Pos(``",PunctuationTok, w) ), ``, " ) ) \]

On top of these trace expressions, FlashFill allows some control structure.

\[ \newcommand{\Switch}{\mbox{Switch}} \newcommand{\Match}{\mbox{Match}} \begin{array}{rcl} \mbox{String program} & P :=& \Switch((b_1, e_1), \ldots, (b_n, e_n)) ~ | ~ f \\ \mbox{Boolean condition} & b :=& d_1 \vee \ldots \vee d_n \\ \mbox{Conjunction} & d := & \pi_1 \wedge \ldots \wedge \pi_n\\ \mbox{Predicate} & \pi := & \Match(v_i, r, k) ~|~ \neg \mbox{Match}(v_i, r, k) \\ \end{array} \]

The $\Switch$ expression choses the first expression $e_i$ whose corresponding condition $b_i$ evaluates to true. The $\Match$ predicate will be true if there are $k$ or more occurences of pattern $r$ in string $v_i$.

Lecture6:Slide33 The key idea in the approach will be similar to the STUN approach we studied for bottom-up explicit search in that the goal is first to compute functions that work for some of the inputs, and then discover the conditional structure that combines programs that work for different subsets into a program that works for the whole input set. For the example in the figure, this involves independently discovering the programs that works for the cases with and without area code, and then discovering the condition that switches between the two. Note that the grammar is explicitly designed to support this approach and therefore only allows branches at the top level.

Beyond the choice of language, however, the major innovation in flashfill is in the representation of the version spaces for the trace expressions. By moving beyond lattice-based Boundary Set representations, flashfill was able to efficiently support the language described above. Before we dive into the details of the representation, though, we introduce an idea which is important for that representation, the idea of graph-based representations of sets of programs.

Graph representations for program spaces

Lecture6:Slide35; Lecture6:Slide36; Lecture6:Slide37; Lecture6:Slide38; Lecture6:Slide39; Lecture6:Slide40; Lecture6:Slide41 One of the key ingredients of FlashFill was the idea of using graphs to represent exponential spaces of programs. This idea traces its roots to a set of representations called E-graphs, first introduced by Downey, Sethi and TarjanDowney:1980 and popularized by Nelson through applications ranging from theorem proving DetlefsNS05 to superoptimization Joshi:2002. The key idea in an e-graph is to represent an exponential set of possible programs with a compact graph, with different paths through the graph corresponding to different programs. For example, the animation on the right shows the construction of an E-graph that represents an exponential set of programs that can represent the expression $(x+2*y)+4*y$. The representation starts with an AST for the original expression, but is augmented by special edges that link nodes to equivalent nodes, for example, the node $2*y$ is equivalent to the node $y+y$. Every combination of choices from the different equivalence classes corresponds to a distinct expression in the set. Just as with E-Graphs, the key data-structure in FlashFill is a graph that concisely represents an exponential set of possible expressions.

Representing Trace Expressions

Lecture6:Slide45; Lecture6:Slide46; Lecture6:Slide47; Lecture6:Slide48; Lecture6:Slide49; Lecture6:Slide50 To understand the core idea in the representation, consider the example in the figure. The first key insight is that for every possible Trace Expression that could have generated that string, there is a partition of the string, where each sub-string in the partition was generated by its own atomic expression. For example, in one partition, each character was generated by an atomic expression $\gamma_i$. But in a different partition, maybe the entire string Rob was generated by an atomic expression $\gamma_{50}$. Just like in the e-graph example, we can add edges to the graph corresponding to possible atomic expressions that could have generated different sub-strings. And just like in the e-graph case, every edge we add leads to an exponential growth in the programs we are representing, because in this case, every path through this graph corresponds to a different program that could have represented the string.

This leaves us with two big challenges: how to efficiently represent the programs $\gamma_i$ corresponding to each edge, and how to perform intersection between two such graphs efficiently.

Learning atomic expressions

Lecture6:Slide52 The problem of learning atomic expressions is simpler than it looks thanks to the structure of the language. The first thing to note is that there are only three possibilities for an atomic expression: It may be a constant, it may be a substring, or it may be a loop expression. Constants are trivial to deal with, the more interesting ones are substring expression. The key observation is that the different patterns in the $\SubStr$ expression are independent of each other. For example, in the example from the figure, any pattern that matches at position 1 can be combined with any pattern that matches at position seven. The patterns themselves can actually be enumerated exhaustively given the limited scope of the pattern language. Identifying loops simply boils down to identifying repeating patterns in the $\gamma_i$ that can be merged into a common expression using antiunification.

Intersection

Given two graphs representing a set of trace expressions, it is relatively easy to compute a new graph that represents its intersection. At a high-level, given two graphs $G_1 = (N_1, E_1)$ $G_2 = (N_2, E_2)$ where $N_i$ are the sets of nodes and $E_i = \{(s^t_i, d^t_i, \gamma^t_i)\}$ are sets of edges from a source node $s^t_i$ to a destination node $d^t_i$ labeled with a set of atomic expressions $\gamma^t_i$, we can compute a graph for the intersection as follows: \[ \begin{array} {rcl} G_1 \cap G_2 &=& (N, G) \mbox{ where} \\ N &=& N_1 \times N_2\\ G &=& \{ ((s^t_1, s^v_2) , (d^t_1, d^v_2), \gamma^t_1 \cup \gamma^v_2 ) | (s^t_1, d^t_1, \gamma^t_1) \in E_1 \wedge (s^v_2, d^v_2, \gamma^v_2) \in E_2 \} \end{array} \] In other words, the graph representing the intersection will have a node for each pair of nodes from the two graphs, and for every pair of edges, there will be an edge representing their intersection.

Lecture6:Slide53; Lecture6:Slide54; Lecture6:Slide55; Lecture6:Slide56 The figure illustrates this intersection process. In the figure we can see how select pairs of edges from the two original graphs get intersected. For example, there is an edge in one graph corresponding to the whole thing being a single string, but it's intersection with another edge that also involves a constant string leads to an empty set.

On the other hand, the intersection between an edge that includes a substring from before the first word to after the first character intersects with a similar edge from another example and leads to an edge with the expression SubStr(in, Pos(“”, Word, 1), Pos(Char,””, 1)). The indices in the Pos expression are omitted in the figure for clarity.

This basic approach has been extended to a number of different domains. More recently, Alex Polozov and Sumit Gulwani, have developed a framework called Prose (also called FlashMeta in the literature) that makes it possible to easily build these kind of representations for other domainsPolozovG15.

Introduction to Program Synthesis