Lecture 6: Version Space Algebras from SMARTedit to FlashFill.
Up to this point, we have been exploring a series of enumerative search techniques; what made these techniques enumerative was that we were explicitly constructing ASTs one-by-one in the process of exploring the space. For the rest of this unit, we will be switching gears to a different class of search techniques based on symbolic representations of program spaces. The key idea behind these techniques is that instead of enumerating ASTs one-by-one we have some data-structures (or symbols) that concisely represent entire sets of programs. By manipulating these symbols, we can efficiently eliminate large sets of possible programs.Version Spaces
Lattice-based version spaces
One important class of representations are Lattice-based representations. At a high-level, a lattice is a partially ordered set where we can define a unique least upper bound (lub) and a greatest lower bound (glb) for any pair of elements $x$ and $y$, where a least upper bound is the smallest element that is greater than or equal to $x$ and $y$ and the greatest lower bound is the largest element that is less than or equal to $x$ and $y$. For example, the integers form a lattice, where $lub(x,y) = max(x,y)$ and $glb(x,y)=min(x,y)$. Another example of a lattice is the set of all predicates over some set of variables $p(x)$. We can define a partial order over this set by saying that $p \leq q \mbox{ iff } p \Rightarrow q$. Then, given two predicates $p$ and $q$, we can see that $lub(p, q) = p \vee q$ and $glb(p,q) = p \wedge q$. Now, one common way of defining a range of integers is by its endpoints, so we can use the notation $[a, b]$ to represent the set of integers $\{x | a \leq x \leq b\}$, assuming that $a \leq b$. It turns out the same idea can apply to other lattices; given two elements $p, q \in H$ belonging to some lattice $H$, if $p \leq q$, then we can define the set $[p, q] = \{x \in H | p \leq x \mbox{ and } x \leq q \} $. In the version space literature, it is said that a version space is Boundary Set Representable if the hypothesis space is a lattice and the version space can be represented as a range in terms of its two endpoints $[S, G]$. The letter $S$ is used to refer to the most specific hypothesis, and the letter $G$ to the most general. This idea of using endpoints in a lattice to represent a set of concepts goes back to Mitchell's seminal paper on version spaces back in 1982Mitchell82. Example: An example from Lau et al. is a class of functionsFindSuffix(T)
. This
class of functions is parameterized by a string $T$, and is used in the text editing domain to move
the cursor right before the next occurrence of the string $T$.
Now, consider the following piece of text from Churchill's famous speech.
We shall go on to the end. We
↓1shall
fight in France, we ↓2shall
fight on the seas and oceans,
we ↓3shall
fight with growing confidence and growing strength in the air.
FindSuffix(T)
for
many different strings T
, including T="s"
, T="shall"
, T="shall fight on the seas and oceans"
and more.
The important thing to note is that we can define a lattice over the strings $T$ based on the following operations.
\[
\begin{array}{ll}
T_1 \leq T_2 & \mbox{iff } T_1 \mbox{ prefix } T_2 \\
glb(T_1, T_2) =& \mbox{longest common prefix of }T_1 \mbox{ and } T_2\\
lub(T_1, T_2) = & \mbox{shortest string that has } T_1 \mbox{ and } T_2 \mbox{ as prefix}
\end{array}
\]
The reader may notice that the structure above is not quite a lattice, because the $lub$ operation
is not well defined for all inputs. For example, $lub($"hello","world"$)$ is not well
defined because there is no string that has both words as a prefix.
A common trick to address this problem for lattices is to define a special value $\top$ (pronounced "top")
that is defined to be greater than all other values. In this case, $\top$ would
be a special string with the property that $\forall T. T \mbox{ prefix } \top$.
Then it is clear that $lub($"hello","world"$)=\top$.
With this lattice, we can see that the space of functions FindSuffix(T)
is
boundary set representable. The set of functions consistent with moving the cursor from
position 1 to position 2 is concisely represented by the range ["s", "shall fight on the seas and oceans...in the air."]
.
Now, suppose we then observe the cursor moving from position 2 to position 3. This second movement
is also consistent with a set of FindSuffix
functions which can be represented with the range
["sh", "shall fight with growing confidence and growing strength in the air."]
.
In this case, note that the set does not include "s", because the function FindSuffix("s")
would have stopped right before "seas" instead of jumping all the way to position 3, so it is not
consistent with the example.
Now we mentioned before that one important aspect of symbolic representations is the ability to
manipulate them efficiently. In general, for boundary set representable version spaces,
we can easily compute intersections as shown below:
\[
[a_l, a_h] \cap [b_l, b_h] = [lub(a_l, b_l), glb(a_h, b_h)]
\]
So in this case, if we want the set of programs that are consistent with both demonstrations, we can
represent that set concisely as
["sh", "shall fight "]
, since "sh" is the shortest string that has both "s" and "sh" as a
prefix, and "shall fight " is the longest common prefix of the two upper bounds.
Version Space Algebra
Lattice-based version spaces are great if you have them, but can also be very restrictive. The real power of the Version Space Algebra approach is the ability to symbolically represent complex compositions of simpler version spaces. For example, two powerful operations for combining version spaces are Union and Join explained below. Union. The first form of composition is union $VS_{H_1, D} \cup VS_{H_2, D}$. As an example, consider the version space forFindSuffix
presented earlier. There is an analogous
FindPrefix(T)
that finds the position in the text after the next
occurrence of the string $T$. FindPrefix
also has a lattice
associated with it, except it is defined in terms of suffixes instead of prefixes.
Now, we saw earlier that given two examples, one from the
cursor from position 1 to 2, and one from 2 to 3, you could get the space
FindSuffix(["sh", "shall fight "])
. Well, by a similar process, we can see that the
two examples are consistent with the space FindPrefix(["we", ", we"])
. So the
union of these two version spaces can be represented simply as
\[
\begin{array}{c}
VS_{H_1, D_1} \bowtie VS_{H_2, D_2} = \{ (h_1, h_2) | h_1 \in VS_{H_1, D_1}, h_2 \in VS_{H_2, D_2}, C((h_1, h_2), D) \}\\
\mbox{where} \\
D=\{(d_1^i, d_2^i)\} \mbox{ given } D_1 = \{d_1^i\}\mbox{ and } D_2=\{d_2^i\}.
\end{array}
\]
The function $C$ in the definition is meant to stand for a consistency
check that can be used to select which pairs in the cross product
can actually be combined together.
In the case of PBD, Join can be used to describe sequences of operations
together, especially when the demonstration allows us to separately provide
evidence for the different actions in the sequence.
Continuing with our running example, suppose that each of the jumps of the cursor,
from position 1 to position 2, to position 3 were treated as part of a
sequence of actions, rather than independent examples corresponding
to the same program. In that case, we could model this as a three way
join of actions, and maintain a version space for each of the joins.
Overall, the key idea in the version space algebra framework is that
we have symbolic representations of sets of progrmas. The representations may look
like ASTs, but they are not ASTs representing programs. Instead,
they are ASTs representing the algebraic operations over sets of programs
that led to the current set.
Version Spaces meet E-Graphs in FlashFill
The general framework of version space algebras is ultimately only as powerful as the underlying representations used for the individual version spaces. The work of Lau and her collaborators built a complex space of text manipulation functions from lattice based representations and compositions thereof, but ultimately they were not very successful in providing sufficiently reliable automation. Around 2010, though, Sumit Gulwani and his collaborators revisited some of these ideas armed with more powerful program representations that enabled a careful language design aimed at a narrower class of text manipulations and developed the FlashFill system which was eventually incorporated into Excel Officegulwani:2011:flashfill SinghG12.A language for text manipulation.
The first key ingredient in the flashfill system was the design of the language. The language is described in terms of two levels of abstraction. The first level consists of Trace Expressions defined according to the grammar below:
\[
\newcommand{\Loop}{\mbox{Loop}}
\newcommand{\SubStr}{\mbox{SubStr}}
\newcommand{\Pos}{\mbox{Pos}}
\newcommand{\Cat}{\mbox{Concat}}
\begin{array}{rcl}
\mbox{Trace expression} & e :=& \Cat(f_1, \ldots, f_n) ~ | ~ f \\
\mbox{Atomic expression} & f :=& \mbox{ConstStr}(s) ~~~ \mbox{String constant } s \\
~ & ~ & |~ \mbox{SubStr}(v_i, p_1, p_2)~~~ \mbox{Produces a substring of } v_i \mbox{ that lies between the positions indicated by } p_1 \mbox{ and } p_2 \\
~ & ~ & |~ \mbox{Loop}(\lambda w. e) ~~~ \mbox{Looping construct where } e \mbox{ is the loop body and } w \mbox{ the loop index}.\\
\mbox{Position} & p := & CPos(k) ~~~\mbox{Constant position } k\\
~ & ~ & Pos(r_1, r_2, c) ~~~\mbox{The } c^{th} \mbox{ position that has r1 as prefix and r2 as suffix}\\
\mbox{Integer expr} & c := & k ~|~ k_1 * w + k ~~~~\mbox{where } w \mbox{ is a loop index} \\
\mbox{Regular expr} & r := & T ~|~ TokenSeq(T_1, \ldots, T_k) | \epsilon
\end{array}
\]
The langauge above is very expressive. For example, in our running
example, suppose I want to collect all phrases that appear between a
"shall" and a punctuation mark, including the word shall. We could write such an expression as
\[
\Loop(\lambda w. \SubStr(in, \Pos(``", ``shall", w), \Pos(``",PunctuationTok, w) ) )
\]
The Loop construct iterates the expression until one of the terms becomes invalid. The expression itself
finds the $w^{th}$ substring that starts with "shall" and ends right before a punctuation token.
If we wanted to separate those with commas, we could use the program below instead:
\[
\Loop(\lambda w. \Cat(\SubStr(in, \Pos(``", ``shall", w), \Pos(``",PunctuationTok, w) ), ``, " ) )
\]
On top of these trace expressions, FlashFill allows some control structure.
\[
\newcommand{\Switch}{\mbox{Switch}}
\newcommand{\Match}{\mbox{Match}}
\begin{array}{rcl}
\mbox{String program} & P :=& \Switch((b_1, e_1), \ldots, (b_n, e_n)) ~ | ~ f \\
\mbox{Boolean condition} & b :=& d_1 \vee \ldots \vee d_n \\
\mbox{Conjunction} & d := & \pi_1 \wedge \ldots \wedge \pi_n\\
\mbox{Predicate} & \pi := & \Match(v_i, r, k) ~|~ \neg \mbox{Match}(v_i, r, k) \\
\end{array}
\]
The $\Switch$ expression choses the first expression $e_i$ whose
corresponding condition $b_i$ evaluates to true. The $\Match$
predicate will be true if there are $k$ or more occurences of pattern
$r$ in string $v_i$.
Graph representations for program spaces
Representing Trace Expressions
Learning atomic expressions
Intersection
Given two graphs representing a set of trace expressions, it is relatively easy to compute a new graph that represents its intersection. At a high-level, given two graphs $G_1 = (N_1, E_1)$ $G_2 = (N_2, E_2)$ where $N_i$ are the sets of nodes and $E_i = \{(s^t_i, d^t_i, \gamma^t_i)\}$ are sets of edges from a source node $s^t_i$ to a destination node $d^t_i$ labeled with a set of atomic expressions $\gamma^t_i$, we can compute a graph for the intersection as follows: \[ \begin{array} {rcl} G_1 \cap G_2 &=& (N, G) \mbox{ where} \\ N &=& N_1 \times N_2\\ G &=& \{ ((s^t_1, s^v_2) , (d^t_1, d^v_2), \gamma^t_1 \cup \gamma^v_2 ) | (s^t_1, d^t_1, \gamma^t_1) \in E_1 \wedge (s^v_2, d^v_2, \gamma^v_2) \in E_2 \} \end{array} \] In other words, the graph representing the intersection will have a node for each pair of nodes from the two graphs, and for every pair of edges, there will be an edge representing their intersection.
SubStr(in, Pos(“”, Word, 1), Pos(Char,””, 1))
.
The indices in the Pos expression are omitted
in the figure for clarity.
This basic approach has been extended to a
number of different domains. More recently,
Alex Polozov and Sumit Gulwani, have
developed a framework called Prose (also called
FlashMeta in the literature) that
makes it possible to easily build these kind
of representations for other domainsPolozovG15.