Lecture 19: Synthesis with abstract interpretation
Abstract interpretation is a verification technique that was
first formalized by Patrick Cousot and Radhia Cousot in a
POPL 1997 paper
CousotC77. Abstract interpretation is
an extensive field with entire conferences dedicated to
advancing the state-of-the-art. In this section, we give a
very brief introduction to some of the key ideas in abstract
interpretation and abstraction-based analysis and then talk
about how they can be leveraged in the context of program synthesis.
For an imperative programming language, the semantics of a command
describe how that command can transform an initial state into a final state.
More generally, we can use the semantics of the language to infer a set
of possible final states of a program given a set of initial states
for that program.
For example, in the figure, we can see how the statement
transforms a set of possible initial states into a set of possible final states.
The key idea in abstract interpretation is to define
Abstract states,
which are basically symbols that represent sets of concrete states, and then
to define the semantics of the program in terms of these abstract states.
For example, in the figure, we have defined abstract states
, where
all the states where
is even, and
all the states where
is odd. It is very common in
abstract interpretation to define
Abstract values as sets of possible
values, and then to define the abstract state as simply a mapping from
variables to abstract values. Given these abstract values, the figure shows
how the operation of addition can be defined for all the different
combinations of abstract values, so for example, adding two odd numbers
results in an even number.
Another key idea in abstract interpretation is to define the abstract
values so that they form a lattice. In the case of our example,
we need two additional abstract values to form a lattice.
The two additional values will correspond to the set of all numbers (Anything, or $\top$),
and the empty set (Nothing or $\bot$).
The abstract domain is related to the concrete domain by
Abstraction and Concretization functions.
The concretization function describes the set of concrete states
that a particular abstract state belongs to, while the abstraction
function maps elements or sets of elements to the best abstract
state that contains them, where
best means lowest in
the lattice. For example, in the figure we illustrate
how different subsets of concrete values have a corresponding
best abstraction.
The key idea in Abstract interpretation is that for each program
point, we can compute an abstract state whose concretization
overapproximates the set of states that could occur at that
program point. In principle, assigning $\top$ to every program
point is one such overapproximation, but not a very useful one.
So an additional goal is to find as precise an overapproximation
as possible.
The basic algorithm for computing these approximations is as follows.
First, all program points are initialized to $\bot$, except for the
program entry, where the abstract state is initialized to approximate
the set of possible initial states for the program.
For every basic block, the algorithm then computes an output
state given its input state. In program points where the output of
two different basic blocks converge, the analysis computes the least
upper bound of the two incoming states. Every time a basic block
is executed, its output is potentially updated, and every time the
input to a basic block changes, the basic block must be reexecuted.
If the latice corresponding to the abstract state has a bounded height,
the iteration process will eventually converge.
Synthesis with abstract interpretation
Consider the program in the figure. The program has two unknowns, and
it has an asertion that asks us to prove that
is even
at the end of every iteration.
It would be tempting to try to solve for them using the CEGIS approach
Lecture 10, simply using
abstract interpretation as our checking procedure. Unfortunately that
would not work because when the program fails to verify, abstract
interpretation is unable to provide us with a counterexample input.
On the other hand, abstract interpretation for the program
reduces to a series of equations, where $lub$ is the least
upper bound in the lattice.
x_{l1} = \top * ??_1 & y_{l1} = even \\
x_{l2} = lub(x_{l1}, x_{l3}) & y_{l2} = lub(y_{l1}, y_{l3}) \\
x_{l3} = x_{l2} - y_{l2} & y_{l3} = ??_2 + x_{l3} \\
x_{l3} = even & \\
Given the definitions of $+$ and $-$ in the absract domain,
the equations above can be solved directly to find that
$??_1=even$ and $??_2=even$. What that tells us, is that
any value of $??_1$ and $??_2$ that can be abstracted to
$even$ is guaranteed to satisfy the assertion.
Unlike CEGIS, which needs to iterate between verification
and synthesis, in this case a single constraint guarantees
that the resulting program will be correct for any input.
Storyboard Programming: Background
The example above illustrates a particularly simple case of
synthesis with abstract interpretation using an extremely simple
domain. At the other end of the spectrum, in 2011 we published
a paper that used a complex abstract domain for heap-based
datastructures to synthesize data-structure manipulations
The high-level idea of the paper was to allow the programmer
to describe the behavior of a data-structure manipulation
by using abstract shapes as input/output examples.
For example, the figure shows input and output an abstract shapes
for a list reversal.
The initial shape has a head pointer pointing to a concrete node
which then points to an abstract node $mid$ which represents
an undetermined number of nodes, one of which is labeled $e$
and points to the $b$ node which is the last node in the list.
The final shape is similar, but with the roles of $a$ and
$b$ nodes reversed.
The $fold$ and $unfold$
operations show how a single $mid$ node can be expanded
into either a single node, or a single node connected to a similar
$mid$ node. They are used to describe the recursive structure
of the set of nodes represented by $mid$.
In order to understand how the abstract domain works, it is
important to first understand the concrete domain. The concrete
state of the program consists of a set of memory locations and
a set of variables and fields. Variables are represented
as predicates on memory locations. $v_i(l)=true$ indicates
that the variable $v_i$ points to the location $l$.
Fields connect two memory locations and are also represented
by a predicate. For example, in the figure, $Next(a, b)$
indicates that the field $Next$ of $a$ points to $b$.
The abstract domain is similarly represented with
predicates, but the predicates take values in a
three-valued logic, so they can be either $True$,
$False$ or $Unknown$, often represented as $1/2$.
In addition to the predicates
representing fields and variables, there is also
a predicate $sm$ that indicates whether
the node is a
summary node like $mid$
that represents multiple nodes or whether
it represents a single node.
Note that in the example in the figure, the
$next$ field contains a number of $Unknown$ values.
For example, $Next(a, mid)$ is $1/2$ because
$mid$ represents multiple nodes, some of which
are pointed to by $a.next$ and some of which are not.
$Next(mid, b)$ is $1/2$ for the same reason, some
nodes represented by $mid$ may point to $b$, and
some may not.
The figure also illustrates how the $Unfold$ operation
modifies the abstract state, either collapsing
$mid$ to a single node, or expanding it into a stand-alone
node and a new $mid$ node.
As an example of how the abstract domain is used
to verify a program, consider the example in the
figure. The abstract domain as we have defined
it is not sufficiently precise to verify
the code as written. However, it can be modified
by adding an auxiliary variable and explicit fold and
unfold operations. Then, it will be able
to reach a fixpoint.
This work illustrates a point that was first
observed by Vechev, Yahav and Yorsh in 2010
about the interplay between synthesis and abstraction-based verification.
In principle, synthesis of verified code would seem to be
strictly harder than verification. However, verification
is difficult because a good verifier based on abstract interpretation
needs to be robust to all the different ways in which a
programmer may write a given piece of code. On the other hand,
when we synthesize using abstract interpretation as the correctness
criteria, the synthesizer can work around the limitations of
the abstract interpreter. In the example above, our abstract
domain is too simplistic to be able to verify the simple version
of the code, so a synthesizer would not generate that version;
instead, the synthesizer would generate the version with the
auxiliary variable and the explicit
operations, because that is the version that is easy to verify
with the given abstract domain.
Storyboard Programming: Synthesis
The storyboard programming algorithm takes
as input a series of scenarios corresponding
to input and output shapes, as well
as a skeleton like the one in the figure above.
From the skeleton, the system generates a set
of dataflow equations. Normally, the dataflow
equations would involve a set of transition
functions, and the goal would be to
find the abstract states at every program point.
In our case, however, the transition
functions are unknown, because the code is unknown.
So the goal is to discover both the states
and the transition functions to ensure
that the constraints imposed by the user-provided
diagrams are satisfied.
One important aspect of this system is that if you
recall from the verification example above,
the abstract state involves not just one but a set
of abstract shapes, potentially complicating the
representation. Rather than explicitly modeling
the sets, the system uses non-determinism
to represent the possible choices of operations
like $Unfold$ which generate sets of shapes,
so the analysis needs to ensure that the
specification is satisfied for all possible non-deterministic