Data Structure Consistency

Data structure consistency is a core program correctness property. Errors that cause data structure corruptions can cause programs to produce incorrect results, crash, or infinite loop. Motivated by these issues, my research group pioneered two orthogonal data structure consistency techniques: data structure repair, which detects and repairs inconsistent data structures and data structure verification, which formally verifies that data structure implementations satisfy their specifications.

Our data structure repair research produced the first specification-driven data structure repair algorithms and systems. Our data structure verification research produced the first verification of fully functional correctness for linked data structures such as mutable lists, trees, graphs, and hash tables.

Data Structure Repair

Software errors (such as data structure implementation defects or memory errors in other components) and hardware errors can cause data structures in running programs to violate key data structure consistency properties. This violation can cause the program to produce unacceptable results or even fail. Our IEEE TSE 2006 paper presents techniques that repair the data structure, eliminate inconsistencies, and enable the system to continue to execute successfully. We have used our data structure repair system to repair inconsistent data structures and eliminate failures in a variety of applications, including an air-traffic control system, an open-source word processing system, and an interactive multiplayer game.

Earlier versions of this research appeared in Our ICSE 2005 paper, our ISSRE 2003 paper, our OOPSLA 2003 paper, and our SMS 2003 paper. This research produced the first specification-driven data structure repair algorithms and systems.

Formal Verification of Data Structure Consistency

Our data structure verification research pioneered foundational techniques (such as integrated reasoning) for specifying and verifying data structure correctness, including the first verification of fully functional correctness for a range of linked data structures such as mutable lists, trees, graphs, and hash tables. In general, we are interested in two kinds of data structure consistency:

Local Consistency: Each individual data structure (such as a tree, list, or hash table) maintains its representation invariant. For example, the representation invariant for a doubly-linked list states that following the next pointer then the prev pointer produces a pointer to the original list element. The representation invariant for a hash table states that each entry is correctly indexed. Representation invariants are typically quite detailed and universally true for all instantiations of the given data structure.
Global Consistency: Global consistency properties cut across traditional encapsulation boundaries to involve multiple data structures. A global consistency property might state, for example, that two data structures contain disjoint sets of objects, or that one data structure contains a subset of the objects present in another data structure. Global consistency properties often capture some key requirement of the application domain that must be reflected in the way that the application represents data from that domain.

Two problems complicate the realization of this goal. The first problem is scalability - analyses that are capable of verifying our target class of sophisticated consistency requirements simply do not scale to sizable programs and there is little hope that they will ever do so.

The second problem is diversity. There is an enormous range of data structures with very different consistency properties. Consider, for example, a list and a hash table. The list consistency properties involve multiple objects linked together with pointers, while the hash table consistency properties involve arrays and array indices. These different kinds of properties require analyses that focus on very different kinds of basic properties.

We adopt a new perspective on the data structure consistency problem: we propose a technique that developers can use to apply multiple pluggable analyses to the same program, with each analysis applied to the modules for which it is appropriate. The key features of this technique include:

Modular Analysis and Shared Objects: In our approach, the program contains a collection of modules, each of which encapsulates the implementation of some of the data structures. Instead of attempting to analyze the entire program, each analysis operates on a single module. By focusing each analysis on only those parts of the program that are relevant for the properties it is designed to verify, we enable the application of precise analyses to programs composed of multiple modules.

One factor that complicates this approach is the need for objects to participate in multiple data structures and therefore the need to share objects between modules analyzed by different algorithms. To eliminate the possibility that one module may corrupt another's data structures (and to ensure that each algorithm analyzes all of the relevant code), modules encapsulate fields (and not objects): the underlying language prevents one module from accessing the fields of another module. Each module therefore encapsulates all of the fields required to implement its data structure; objects that participate in multiple data structures from multiple modules contain fields from each of these modules.
Sets and Interfaces: Each module has an implementation and an interface. It is the responsibility of the analysis to verify that the implementation correctly implements the interface. Instead of exposing the implementation details of the encapsulated data structure, the interface uses a collection of abstract sets to summarize the effect of each procedure. This collection of sets characterizes how objects participate in various data structures. For example, the interface for a linked list might have an abstract set that contains all of the objects in the linked list. The interface for the insert procedure would indicate that the procedure adds the inserted object into the set; the interface for the remove procedure would indicate a corresponding removal from the set.
Abstraction Modules and Data Structure Consistency: Each analysis uses an abstraction module to establish the connection between the module's implementation and its interface. This abstraction module enables each analysis to translate the set membership properties of those objects that cross module boundaries back into concrete data structure properties. These concrete properties are often crucial for preserving the internal consistency of the data structures.

For example, a list insert procedure may require that an object to be inserted must not already be a member of the abstract set containing all of the objects in the list. This set membership property, in turn, may enable the analysis of the insert procedure to recognize that the object is not already present in the list, which may be the precondition required to preserve the internal consistency of the list data structure.
Properties Involving Multiple Modules: Systems often maintain important data structure consistency invariants that involve multiple data structures and cross encapsulation boundaries. For example, some systems may require the sets of objects that participate in two data structures to be disjoint; others may require every object present in one data structure to also be present in another. Such invariants are typically violated (temporarily and legitimately) as modules execute a coordinated sequence of data structure updates. Because these invariants involve objects that participate in different data structures encapsulated in different modules analyzed by different analyses, the analyses must somehow interoperate if they are to successfully verify the invariant.

Our approach uses analysis scopes to identify invariants that involve sets from multiple modules. Some of these modules are exported and can be invoked from outside the scope; the other modules may be invoked only from within the scope. When our analysis processes an exported module it ensures that, if the invariant holds upon entry to the exported modules, then it is correctly restored upon exit.

For invariants that involve multiple modules, the set abstraction enables different analyses to combine their results to verify that the program properly coordinates its data structure operations to preserve the invariant. This approach therefore enables the application of arbitrarily sophisticated analyses to the modules that require the precision such analyses can bring, while supporting the use of simpler, more efficient analyses in the remainder of the program.

Together, these features enable the focused application of a full range of precise analyses to programs that contain multiple data structures encapsulated in multiple modules. They promote the development of a range of pluggable analyses that developers can deploy as necessary to verify important data structure consistency properties. Abstract sets enable different analyses to communicate and interoperate to verify properties that cross module boundaries to involve multiple data structures. Our approach supports the appropriately verified participation of objects in multiple data structures, including patterns in which objects migrate sequentially through different data structures and patterns in which objects participate in multiple data structures simultaneously.

This research produced the following systems:

The Hob System: Hob focuses both on internal consistency properties of linked data structures (such as the consistency of the links in a tree or list) and consistency properties that involve multiple data structures (for example, the set of objects in one data structure must be a subset of the set of objects in another data structure). Hob uses sets of abstract objects in its specifications. It can therefore model sophisticated typestate properties of objects as well as important data structure consistency constraints. One particularly noteworthy aspect of the Hob system is its modular design: it includes a variety of techniques that enable modular specification and verification of data structures, including the use of pluggable, arbitrarily sophisticated program analysis and verification systems.

One of the most significant results of the Hob project was that its specification techniques enable the verification of externally relevant design properties. The specification for our Hob implementation of the Minesweeper game, for example, expresses the rules of the game in a way that is directly relevant to users of the game. These specifications stand in stark contrast to standard software specifications, which are largely incomprehensible to users of the system since they focus on the internal implementation details of the program rather than properties that are directly relevant to users.

Patrick Lam's PhD thesis presents a comprehensive overview of the Hob system. Our AOSD 2005 paper shows how we use aspect-oriented techniques to enable the modular specification and verification of programs that manipulate sophisticated linked data structures. Our Hob IEEE Transactions on Software Engineering paper focuses on how Hob enables multiple modular pluggable program analyses to work together to prove sophisticated correctness properties. See my publications page for more Hob publications.

The Jahob System: Jahob is a comprehensive program specification and verification system. It uses integrated reasoning to solve verification problems - it combines a broad collection of automated reasoning systems (with varying performance, reasoning power, and scope characteristics) that work together within the Jahob framework to verify the automatically generated verification conditions. Jahob has been used to verify the full functional correctness of a collection of linked data structures. By full functional correctness we mean that the system verifies every property (with the exception of resource consumption properties such as running time and memory consumption) that the client relies on to use the data structure correctly.

Our PLDI 2008 paper describes how we used Jahob to perform the first verification of full functional correctness for a range of linked data structures, including mutable lists, trees, graphs, and hash tables. Our PLDI 2009 paper presents our integrated proof language, which enables developers to guide the correctness proofs while staying within the conceptual framework of the underlying imperative programming language. In this approach, the developer simply controls key proof choice points, with the powerful Jahob integrated reasoning system performing the vast majority of the verification.

Our PLDI 2011 paper presents a technique for proving that operations on linked data structures commute - a key property for reasoning about many parallel and distributed computations. See my publications page for more Jahob publications, including a broad range of publications on the logical techniques behind Jahob.

Viktor Kuncak's PhD thesis describes the techniques that form the foundation of the Jahob system. Karen Zee's PhD thesis presents our experience using Jahob to verify the full functional correctness of a range of linked data structures.