Introduction to Program Synthesis

© Armando Solar-Lezama. 2018. All rights reserved.

Lecture 19: Synthesis with abstract interpretation

Abstract interpretation is a verification technique that was first formalized by Patrick Cousot and Radhia Cousot in a POPL 1997 paperCousotC77. Abstract interpretation is an extensive field with entire conferences dedicated to advancing the state-of-the-art. In this section, we give a very brief introduction to some of the key ideas in abstract interpretation and abstraction-based analysis and then talk about how they can be leveraged in the context of program synthesis.

Lecture19:Slide3; Lecture19:Slide4; Lecture19:Slide5 For an imperative programming language, the semantics of a command describe how that command can transform an initial state into a final state. More generally, we can use the semantics of the language to infer a set of possible final states of a program given a set of initial states for that program. For example, in the figure, we can see how the statement x:=x+2 transforms a set of possible initial states into a set of possible final states.

The key idea in abstract interpretation is to define Abstract states, which are basically symbols that represent sets of concrete states, and then to define the semantics of the program in terms of these abstract states. For example, in the figure, we have defined abstract states Even and Odd, where Even represents all the states where x is even, and Odd represents all the states where x is odd. It is very common in abstract interpretation to define Abstract values as sets of possible values, and then to define the abstract state as simply a mapping from variables to abstract values. Given these abstract values, the figure shows how the operation of addition can be defined for all the different combinations of abstract values, so for example, adding two odd numbers results in an even number.

Lecture19:Slide6; Lecture19:Slide7; Lecture19:Slide8 Another key idea in abstract interpretation is to define the abstract values so that they form a lattice. In the case of our example, we need two additional abstract values to form a lattice. The two additional values will correspond to the set of all numbers (Anything, or $\top$), and the empty set (Nothing or $\bot$).

The abstract domain is related to the concrete domain by Abstraction and Concretization functions. The concretization function describes the set of concrete states that a particular abstract state belongs to, while the abstraction function maps elements or sets of elements to the best abstract state that contains them, where best means lowest in the lattice. For example, in the figure we illustrate how different subsets of concrete values have a corresponding best abstraction.

The key idea in Abstract interpretation is that for each program point, we can compute an abstract state whose concretization overapproximates the set of states that could occur at that program point. In principle, assigning $\top$ to every program point is one such overapproximation, but not a very useful one. So an additional goal is to find as precise an overapproximation as possible. Lecture19:Slide9; Lecture19:Slide10; Lecture19:Slide11 The basic algorithm for computing these approximations is as follows. First, all program points are initialized to $\bot$, except for the program entry, where the abstract state is initialized to approximate the set of possible initial states for the program. For every basic block, the algorithm then computes an output state given its input state. In program points where the output of two different basic blocks converge, the analysis computes the least upper bound of the two incoming states. Every time a basic block is executed, its output is potentially updated, and every time the input to a basic block changes, the basic block must be reexecuted.

If the latice corresponding to the abstract state has a bounded height, the iteration process will eventually converge.

Synthesis with abstract interpretation

Lecture19:Slide15 Consider the program in the figure. The program has two unknowns, and it has an asertion that asks us to prove that y is even at the end of every iteration. It would be tempting to try to solve for them using the CEGIS approach from Lecture 10, simply using abstract interpretation as our checking procedure. Unfortunately that would not work because when the program fails to verify, abstract interpretation is unable to provide us with a counterexample input. On the other hand, abstract interpretation for the program reduces to a series of equations, where $lub$ is the least upper bound in the lattice. \[ \begin{array}{cc} x_{l1} = \top * ??_1 & y_{l1} = even \\ x_{l2} = lub(x_{l1}, x_{l3}) & y_{l2} = lub(y_{l1}, y_{l3}) \\ x_{l3} = x_{l2} - y_{l2} & y_{l3} = ??_2 + x_{l3} \\ x_{l3} = even & \\ \end{array} \] Given the definitions of $+$ and $-$ in the absract domain, the equations above can be solved directly to find that $??_1=even$ and $??_2=even$. What that tells us, is that any value of $??_1$ and $??_2$ that can be abstracted to $even$ is guaranteed to satisfy the assertion. Unlike CEGIS, which needs to iterate between verification and synthesis, in this case a single constraint guarantees that the resulting program will be correct for any input.

Storyboard Programming: Background

Lecture19:Slide16; Lecture19:Slide17; Lecture19:Slide18; Lecture19:Slide19 The example above illustrates a particularly simple case of synthesis with abstract interpretation using an extremely simple domain. At the other end of the spectrum, in 2011 we published a paper that used a complex abstract domain for heap-based datastructures to synthesize data-structure manipulationsSinghS11.

The high-level idea of the paper was to allow the programmer to describe the behavior of a data-structure manipulation by using abstract shapes as input/output examples.

For example, the figure shows input and output an abstract shapes for a list reversal. The initial shape has a head pointer pointing to a concrete node which then points to an abstract node $mid$ which represents an undetermined number of nodes, one of which is labeled $e$ and points to the $b$ node which is the last node in the list. The final shape is similar, but with the roles of $a$ and $b$ nodes reversed. The $fold$ and $unfold$ operations show how a single $mid$ node can be expanded into either a single node, or a single node connected to a similar $mid$ node. They are used to describe the recursive structure of the set of nodes represented by $mid$.

Lecture19:Slide22; Lecture19:Slide23; Lecture19:Slide24; Lecture19:Slide25; Lecture19:Slide26 In order to understand how the abstract domain works, it is important to first understand the concrete domain. The concrete state of the program consists of a set of memory locations and a set of variables and fields. Variables are represented as predicates on memory locations. $v_i(l)=true$ indicates that the variable $v_i$ points to the location $l$. Fields connect two memory locations and are also represented by a predicate. For example, in the figure, $Next(a, b)$ indicates that the field $Next$ of $a$ points to $b$.

The abstract domain is similarly represented with predicates, but the predicates take values in a three-valued logic, so they can be either $True$, $False$ or $Unknown$, often represented as $1/2$. In addition to the predicates representing fields and variables, there is also a predicate $sm$ that indicates whether the node is a summary node like $mid$ that represents multiple nodes or whether it represents a single node.

Note that in the example in the figure, the $next$ field contains a number of $Unknown$ values. For example, $Next(a, mid)$ is $1/2$ because $mid$ represents multiple nodes, some of which are pointed to by $a.next$ and some of which are not. $Next(mid, b)$ is $1/2$ for the same reason, some nodes represented by $mid$ may point to $b$, and some may not.

The figure also illustrates how the $Unfold$ operation modifies the abstract state, either collapsing $mid$ to a single node, or expanding it into a stand-alone node and a new $mid$ node.

Lecture19:Slide27; Lecture19:Slide28; Lecture19:Slide29; Lecture19:Slide30 As an example of how the abstract domain is used to verify a program, consider the example in the figure. The abstract domain as we have defined it is not sufficiently precise to verify the code as written. However, it can be modified by adding an auxiliary variable and explicit fold and unfold operations. Then, it will be able to reach a fixpoint.

This work illustrates a point that was first observed by Vechev, Yahav and Yorsh in 2010VechevYY10 about the interplay between synthesis and abstraction-based verification. In principle, synthesis of verified code would seem to be strictly harder than verification. However, verification is difficult because a good verifier based on abstract interpretation needs to be robust to all the different ways in which a programmer may write a given piece of code. On the other hand, when we synthesize using abstract interpretation as the correctness criteria, the synthesizer can work around the limitations of the abstract interpreter. In the example above, our abstract domain is too simplistic to be able to verify the simple version of the code, so a synthesizer would not generate that version; instead, the synthesizer would generate the version with the auxiliary variable and the explicit fold/unfold operations, because that is the version that is easy to verify with the given abstract domain.

Storyboard Programming: Synthesis

Lecture19:Slide31; Lecture19:Slide32; Lecture19:Slide33; Lecture19:Slide34 The storyboard programming algorithm takes as input a series of scenarios corresponding to input and output shapes, as well as a skeleton like the one in the figure above.

From the skeleton, the system generates a set of dataflow equations. Normally, the dataflow equations would involve a set of transition functions, and the goal would be to find the abstract states at every program point. In our case, however, the transition functions are unknown, because the code is unknown. So the goal is to discover both the states and the transition functions to ensure that the constraints imposed by the user-provided diagrams are satisfied.

One important aspect of this system is that if you recall from the verification example above, the abstract state involves not just one but a set of abstract shapes, potentially complicating the representation. Rather than explicitly modeling the sets, the system uses non-determinism to represent the possible choices of operations like $Unfold$ which generate sets of shapes, so the analysis needs to ensure that the specification is satisfied for all possible non-deterministic choices.