Performance, Power, and Accuracy Trade Offs
Many computations have an inherent trade off between accuracy,
performance, and power - if the client of the computation is willing
to accept a slightly less accurate result, it is often possible to
adapt the computation to deliver that result while consuming
substantially less power or compute time. But most implementations of
these computations operate at only one point in this underlying
trade-off space. In many cases the implementor of the computation is
not even aware that this trade-off space exists and that the
computation has the ability to operate successfully at a variety of
points in this space.
My research group has developed a variety of techniques that
automatically discover and exploit tradeoffs between accuracy,
performance, and power. These techniques use rich program
transformations (program transformations that may change the
semantics of the program) to induce a search space of programs
surrounding the original program. They then search this space to find
programs with desirable accuracy, performance, and power consumption
- Rely and Chisel:
These two systems are designed to support computations that execute
on approximate and/or unreliable hardware systems. The goal is to
provide programming support that enables developers to reason about
the reliability and accuracy of computations that execute on such
Rely enables developers to
specify reliability requirements (i.e., minimum probabilities
with which the program must produce a correct result), which
Rely compiler then checks. Chisel
works with combined reliability and accuracy requirements. Starting
with a specification of these requirements, the Chisel compiler
automatically chooses which operations must execute precisely and
which can execute unreliabily/approximately while still satisfying
the reliability and accuracy requirements. Chisel also determines
which variables can be stored in reliable memory and which can
be stored in approximate memory. The selection of unreliable/approximate
operations and memory is performed to maximize a resource objective
such as energy savings.
- Statistical Accuracy and Performance Models: My ICS 2006 paper presents techniques for
automatically deriving statistical performance and accuracy models
that characterize the effect of discarding tasks in Jade
computations. It is then possible to use these models to purposefully
discard tasks to improve performance or reduce the power consumption
of the computation while preserving acceptably accurate computation.
My OOPSLA 2007 paper
demonstrates how to use these techniques to eliminate barrier idling
(idle time that incurred when processors wait at a barrier for other
processors to complete tasks) in parallel computations.
- Loop Perforation: Loop perforation automatically
transforms loops to skip some of the iterations that they would
otherwise perform. Our
MIT 2009 Technical
Report shows how to use loop perforation to automatically augment
computations with the ability to operate successfully at a variety of
points in their underlying performance, accuracy, and power
consumption trade-off space. It also shows how to use loop
perforation to enable computations to dynamically adapt to
fluctuations in the amount of delivered computational resources. Such
fluctuations can be caused, for example, by dynamic voltage/frequency
scaling, the loss of processors, or varying load.
Our ICSE 2010 paper shows how to use
loop perforation as a profiling technique that helps developers find
computations that are amenable to optimizations that trade accuracy in
return for increased performance or reduced power consumption.
Our ESEC/FSE 2011 paper shows how to use
loop perforation to automatically discover and exploit performance vs.
accuracy tradeoffs that are inherently present (but, in the absence
of loop perforation, hidden) in many applications.
- Dynamic Knobs: Programs sometimes come with static
configuration parameters that control the performance, power
consumption, and accuracy of various algorithms within the
program. Our ASPLOS 2011 Paper
Technical Report show how to automatically find such parameters,
convert them into dynamic knobs , and use the dynamic knobs to
dynamically adapt the computation to respond to changes in the
characteristics of the underlying computational platform.
- Automatic Parallelization with Statistical Accuracy Tests:
Our MIT 2010 Technical Report
presents the QuickStep compiler, which contains a collection of
(specifically, parallelism introduction transformations, accuracy
enhancing transformations, and performance enhancing transformations).
Starting with a sequential program,
QuickStep automatically applies these transformations to generate a
space of candidate parallel programs, using statistical accuracy
tests to evaluate the acceptability of each candidate parallel program.
QuickStep can automatically generate parallel computations (such as
efficient computations that contain acceptably infrequent data races)
that are inherently beyond the reach of previously proposed automatic
Robust, Secure Software
Software systems are often perceived to be fragile and brittle,
vulnerable to security exploits and errors that cause the system to
crash. My research group has developed a set of techniques that
make software systems more robust, resilient, reliable, and secure by enabling
them to survive a range of otherwise fatal errors. The overall
goal is to make the software survive any error,
continue to execute, and provide acceptable service to its users. One particularly
important application of this technology is to eliminate
security vulnerabilities such as errors that enable
remote software injection attacks.
My Onward! 2003 paper
on Acceptability-Oriented Computing presents the conceptual foundations
behind this approach to robust software systems.
Our Onward! 2005 paper
on Exploring the Acceptability Envelope further explores the implications
of acceptability-oriented computing. Specific
acceptability-oriented computing techniques include:
- Eliminating Infinite Loops:
My research group has developed a system, Jolt, for
automatically eliminating infinite loops. When activated,
Jolt observes loop executions to determine if the state is
the same on successive iterations. If so, it exits the loop
by jumping to a location outside the loop. Our
ECOOP 2011 paper
describes the technique in more detail and presents experimental
results that demonstrate its effectiveness in eliminating
infinite loops in a set of benchmark applications.
- Input Rectification:
Onward! 2007 paper
shows how to manually
build input rectifiers that automatically transform inputs
to eliminate security exploits and features that exercise untested or
poorly tested parts of the software system.
Our ICSE 2012 paper
shows how to automatically construct rectifiers by learning
characteristics of typical inputs that the system
processes correctly. When placed between the
system and its input sources, input rectifiers can increase
security and robustness by keeping the system within a tested and
reliable region of its execution envelope. A key concept is
the comfort zone of a system - the region of
the input space that the system can process without failure,
typically because the system has been tested extensively
on this region.
- Automatic Inference and Enforcement of Observed Invariants:
In collaboration with
Michael Ernst's research group, we have developed
techniques for automatically inferring and enforcing observed invariants
such as data structure consistency properties and properties of
compiled binaries. The goal is to develop patches that correct the effects of errors,
enabling the software to continue to execute and provide service to its
users in spite of the errors.
Our ClearView system uses this technique to automatically
patch otherwise fatal errors in compiled binaries with no
access to source code or debugging information. It observes normal,
error-free executions of the program to derive a model (in the
form of invariants observed to be true in all observed executions)
of the normal behavior of the program. It responds to errors that
cause the program to crash by finding
observed invariants that are violated when the program crashes,
developing patches that change the state of the system
(typically by changing the values in memory locations or registers)
to enforce violated invariants, then observing executions of the
system to find patches that successfully eliminate the effect
of the error to enable the program to continue to execute.
This technique can be especially effective in the context of
a community of machines running similar software, since it
can enable ClearView to obtain patches that make all
machines invulnerable to an attack even though only a
few have actually observed the attack.
SOSP 2009 paper on this topic for
more information on ClearView. An earlier system automatically
inferred and enforced data structure consistency constraints
to successfully correct errors that violate data structure
consistency properties and continue on to provide successful service
to users. See our
ISSTA 2006 paper on this topic for
- Failure-Oblivious Computing: Out of bounds accesses
can cause fatal memory corruption or cause the program to
throw exceptions that terminate the computation. Failure-oblivious
computing is a technique for enabling the
program to survive otherwise fatal out of bounds memory accesses.
This technique performs bounds checks on all memory accesses.
It discards out of bounds writes and manufactures values to
return as the result of out of bounds reads. Our results show
that this technique enables a set of open-source server programs
(Pine, Apache, Sendmail, Midnight Commander, and Mutt)
to survive otherwise fatal buffer overflow attacks.
OSDI 2004 paper on this topic for
- Boundless Memory Blocks: Instead of discarding out of
bounds writes, boundless memory blocks store out of bounds
writes in a hash table, enabling the system to return the
stored values on corresponding out of bounds reads. Conceptually,
this technique gives each allocated block of memory unbounded
size. Our results show that this technique enables a set
of server programs to survive otherwise fatal buffer overflow
ACSAC 2004 paper on this topic for
- Cyclic Memory Allocation: Memory leaks can cause a system
to exhaust its address space and fail. Cyclic memory allocation
eliminates memory leaks by allocating a single fixed-size buffer
for a given allocation site, then cyclically allocating objects
from that site in that buffer. Our results show that this technique
can eliminate remotely-exploitable errors that otherwise cause the
system to leak memory.
ISMM 2007 paper on this topic for
- Data Structure Repair:
Software errors and hardware failures can cause data structures in
running programs to violate key data structure consistency properties.
This violation can cause the program to produce unacceptable results
or even fail.
Our IEEE TSE 2006 paper
presents techniques that repair the data structure,
eliminate inconsistencies, and enable the system to continue
to execute successfully. We have used our data structure
repair system to repair inconsistent data structures
and eliminate failures in a variety of applications,
including an air-traffic control
system, an open-source word processing system, and an interactive
Earlier versions of this research appeared in
Our ICSE 2005 paper,
our ISSRE 2003 paper,
our OOPSLA 2003 paper,
our SMS 2003 paper.
My research group has developed two systems for verifying sophisticated
properties of linked data structures.
- The Hob System:
Hob focuses both on internal consistency properties of linked data structures
(such as the consistency of the links in a tree or list) and
consistency properties that involve multiple data structures (for example,
the set of objects in one data structure must be a subset of the set of
objects in another data structure). Hob uses sets of abstract objects
in its specifications. It can therefore model sophisticated typestate
properties of objects as well as important data structure consistency
constraints. One particularly noteworthy aspect of the Hob system is
its modular design: it includes a variety of techniques
that enable modular specification and verification of data
structures, including the use of pluggable, arbitrarily sophisticated
program analysis and verification systems.
One of the most significant
results of the Hob project was that its specification techniques
enable the verification of externally relevant design properties.
The specification for our Hob implementation of the Minesweeper
game, for example, expresses the rules of the game in a way
that is directly relevant to users of the game.
These specifications stand in stark contrast to standard software
specifications, which are largely incomprehensible to users of the
system since they focus on the internal implementation
details of the program rather than properties that are directly
relevant to users.
Patrick Lam's PhD thesis presents a comprehensive overview of the
Hob system. Our AOSD 2005 paper shows
how we use aspect-oriented techniques to enable the modular specification
and verification of programs that manipulate sophisticated linked
data structures. Our Hob IEEE Transactions
on Software Engineering
paper focuses on how Hob enables multiple modular pluggable
to work together to prove sophisticated correctness properties.
See my publications page for more Hob publications.
- The Jahob System:
Jahob is a comprehensive program specification and verification system.
It uses integrated reasoning to solve verification problems -
it combines a broad collection of automated reasoning systems (with varying
performance, reasoning power, and scope characteristics)
that work together within the Jahob framework
to verify the automatically generated verification conditions. Jahob has been
used to verify the full functional correctness of a collection of linked
data structures. By full functional correctness we mean that the system
verifies every property (with the exception of resource consumption properties
such as running time and memory consumption) that the client relies on to
use the data structure correctly.
Our PLDI 2008 paper
describes how we used Jahob to perform the first verification of
full functional correctness for a range of linked data structures, including
mutable lists, trees, graphs, and hash tables.
Our PLDI 2009 paper presents our
integrated proof language, which enables developers to guide the correctness
proofs while staying within the conceptual framework of the underlying
imperative programming language. In this approach, the developer simply
controls key proof choice points, with the powerful Jahob integrated
reasoning system performing the vast majority of the verification.
Our PLDI 2011 paper presents
a technique for proving that operations on linked data structures
commute - a key property for reasoning about many parallel and
See my publications page for more Jahob publications,
including a broad range of publications on the logical techniques behind
Viktor Kuncak's PhD thesis describes the techniques that form the
foundation of the Jahob system.
Karen Zee's PhD thesis presents our experience using Jahob to verify
the full functional correctness of a range of linked data structures.
Analysis and Optimization of Multithreaded Programs
The key problem with analyzing multithreaded programs is the need to
characterize the effect of all potential interleavings of statements
from parallel threads. My research group pioneered a compositional
approach to this problem in which the algorithm analyzes each thread
once to obtain a result that characterizes all potential
interactions between that thread and other parallel threads.
We have used this approach to develop a variety of analyses for different
kinds of multithreaded programs. Here are several specific analyses:
All of these algorithms have been implemented, either in the Stanford SUIF
compiler infrastructure (for C) or the MIT Flex compiler infrastructure
(for Java) and have been used to analyze and
and perform a range of optimizations on complete benchmark
programs. See my home page for pointers
to pages that discuss the research in more detail.
- The first flow-sensitive pointer analysis algorithm for programs
with structured multithreading. The results have been used by a variety
of research groups to enable optimizations and further analyses.
We have a Web page that
discusses this topic in more depth.
- The first combined pointer and escape analysis for programs with
the unstructured form of multithreading present in Java. See our
PPoPP 2001 paper on this topic for
- The first incrementalized pointer and escape analysis.
Instead of analyzing the whole program, this algorithm builds on our compositional
analysis to analyze only those parts of the program that may deliver useful
optimization opportunities. An
analysis policy monitors the analysis results to direct the
incremental investment of analysis resources to those parts of the
program that offer the highest expected optimization return.
See our PLDI 2001 paper on this
topic for details.
- A new symbolic analysis for characterizing regions of allocated
memory blocks. In addition to loops and array accesses, the analysis
is designed for programs that use pointers, pointer
arithmetic, and recursion. The results have been used for
static race detection for parallel divide and conquer
programs, automatic parallelization of sequential divide and conquer
programs, static detection
of array bounds violations, and elimination of array bounds checks.
Basically, the analysis can verify that the program is completely free
of a wide range of memory access anomalies that can cause a wide range
of very nasty errors.
I have also prepared an invited paper
and an invited talk
for SAS 2001 on the analysis of multithreaded programs.
Analysis and Optimization of Divide and Conquer Programs
Divide and conquer programs present an imposing set of analysis challenges.
The natural formulation of these programs is recursive.
For efficiency reasons, programs often use dynamic memory allocation
to match the sizes of the data structures
to the problem size, and often use
pointers into arrays and pointer arithmetic to
Traditional analyses from the field of parallelizing compilers
are of little or no
help for this class of programs --- they are
designed to analyze loop nests that access
dense matrices using affine array index expressions, not
recursive procedures that use pointers and
We have developed the following analyses and optimizations:
See my home page for pointers
to pages that discuss this research in more detail.
- A new analysis that symbolically characterizes accessed regions
of memory blocks. Because the analysis computes symbolic bounds
for both pointers and array indices, it can be used to solve a wide
variety of program analysis problems. To avoid problems associated
with the use of traditional fixed-point techniques, the analysis
formulates the problem as a system of symbolic constraints. It then
reduces the system to a linear program; the solution to the linear
program yields a solution to the original symbolic constraint system.
We used this algorithm to solve a range of analysis problems,
including static race detection for parallel divide and conquer
programs, automatic parallelization of sequential divide and conquer
programs, static detection
of array bounds violations, and elimination of array bounds checks.
See our PLDI 2000 paper on this
topic for more details.
- A new algorithm for increasing the granularity of the base case
in divide and conquer programs. This algorithm unrolls recursive calls
to reduce control flow overhead and increase the size of basic blocks
in the computation. The result is a significant increase in the performance.
See our LCPC 2000 paper on this topic
for more details.
- The first published
analysis for automatically parallelizing divide and conquer programs.
This analysis has since been superseded by the symbolic bounds
analysis listed above. Nevertheless, you can
see our PPoPP 1999 paper on this
topic for more details.
My research group has investigated a variety of ways of eliminating
synchronization overhead in multithreaded and automatically parallelized
programs. These optimizations
eliminate mutual exclusion synchronization, reduce its
frequency, replace it with more efficient optimistic
synchronization primitives, and replicate objects to
eliminate the need for synchronization. Specific techniques include:
- Synchronization Elimination:
We have developed several
combined pointer and escape analyses
that are designed, in part, to find unnecessary synchronization
Our OOPSLA 1999 paper
presents an escape analysis that finds objects that do not escape
their allocating thread. The compiler uses the results of this
analysis to generate customized, synchronization-free versions of
the operations that execute on such objects.
Our PPoPP 2001 paper presents an
algorithm that analyzes interactions between threads to find
and eliminate synchronization operations that are temporally separated by
thread creation actions.
- Lock Coarsening:
Lock coarsening reduces the frequency
with which the computation acquires and releases locks.
We have developed two approaches: data lock coarsening ,
which associates one lock with multiple objects that tend
to be accessed together, and computation lock coarsening, which
coalesces multiple critical sections that acquire and release the same lock
multiple times into a single critical section that acquires and releases the lock
only once. Our experimental results show that lock elimination can significantly
reduce the synchronization overhead in programs that use mutual
Our JPDC 1998
paper presents an algorithm that performs both data and
computation lock coarsening for object-based programs.
Our POPL 1997 paper presents
a set of general transformations that move lock constructs
both within and between procedures to coarsen the lock
- Dynamic Feedback:
There is a trade-off associated with lock coarsening:
increasing the granularity reduces the synchronization overhead,
but also increases the size of the critical regions, which may
reduce the amount of available concurrency. We have developed
an algorithm that generates several different versions of the
program; the versions differ in how aggressively they apply
the lock coarsening transformation. When the program runs, the
generated code samples the performance of each version, then
uses the version with the best performance. The code periodically
resamples to adapt to changes in the best version.
Our TOCS 1999 paper presents the
algorithm and experimental results. We also published a
PLDI 1997 paper on this topic.
Optimistic synchronization primitives
such as load linked/store conditional provide an attractive
alternative to mutual exclusion locks. Our
PPoPP 1997 paper
presents our experience using optimistic
synchronization to implement atomic operations
in automatically parallelized
object-based programs. Our results show that
using optimistic synchronization can improve the performance
and significantly reduce the amount of memory
required to execute the computation.
- Adaptive Replication:
In many programs it is possible to eliminate synchronization
bottlenecks by replicating frequently accessed objects that cause
these bottlenecks. Each thread that updates such an
object creates its own local replica and performs all updates locally
on that replica with no synchronization and no interaction with other threads.
When the computation needs to access the final value of the object, it combines
the values stored in the replicas to generate the final value.
A key question is determining which objects to replicate -
replicating all objects can cause significant and unnecessary memory
and computation overheads. We have developed a technique that
dynamically measures the contention at objects to replicate only
those objects that would otherwise cause synchronization
Our ICS 1999 paper
presents experimental results from an implementation of this technique.
The results show that,
for our set of benchmark programs, adaptive replication
can improve the overall performance by up to a factor of
three over versions that use no replication, and can
reduce the memory consumption by up to a factor of four over
versions that fully replicate updated objects. Furthermore,
adaptive replication never significantly
harms the performance and never increases the memory usage
without a corresponding increase in the performance.
- Data Race Freedom:
A multithreaded program has a data race if two threads
concurrently access the same location without synchronization,
and at least one of the accesses is a write. Data races
typically reflect the failure of the programmer to correctly
synchronize an atomic operation, and are viewed as one of the
primary complications associated with developing multithreaded
We have recently developed an extended type system for Java;
this system guarantees that any well-typed program is free
of data races. It enables the programmer to specify the
protection mechanism for each object as part of the object's type.
The type can specify either the mutual exclusion
lock that protects the object from unsynchronized concurrent accesses, or
that threads can safely access the object without synchronization because
either 1) the object is immutable, 2) the object is accessible to only the
current thread and is not shared between threads, or 3) the
current thread holds the unique
reference to the object. The type checker uses the type
declarations to verify that the program uses each object only in
accordance with its declared protection mechanisms.
The result is that programmers can aggressively remove superfluous
synchronization without introducing data races.
See our OOPSLA 2001 paper on this
topic for more details.
Traditional parallelizing compilers uses data dependence analysis
to parallelize loop nests that access dense matrices using affine
access functions. While this approach works well in its target
domain, it is not designed for irregular, object-based programs
whose parallel tasks include the computation of many different
We have developed a new analysis approach, commutativity
analysis , that exploits the structure of the object-oriented
paradigm to view the computation as consisting of
operations on objects.
It then analyzes the computation at this
granularity to determine if operations commute (i.e. generate the
same result regardless of the order in which they execute).
If all of the operations in a given computation commute, the compiler
can automatically generate parallel code.
We have used this technique to successfully parallelize a variety
of complete application programs, with the automatically
parallelized versions running between 14 and 19 times faster
(on a 24 processor machine) than the original sequential versions.
See our TOPLAS 1997 paper on this
topic for more details.
I am currently exploring ways to relate the referencing properties of objects
to the changing roles that they play in the computation. Conceptually, each
object's role depends, in part, on the data structures in which it
participates. As the object moves between data structures, its role often changes
to reflect its new relationships. We are developing dynamic analyses to
discover these roles, and static analyses that verify that the program
uses each object correctly in its current role. We expect to use the
results to enhance the programmer's understanding of large programs
and to optimize various aspects of distributed programs such as communication
and failure propagation. One early result is a new type system for ensuring
that multithreaded programs are free of data races. This system allows the
programmer to identify how threads obtain and relinquish control of
different kinds of objects, ensuring that the program observes the
correct acquisition protocol for each object.
See our OOPSLA 2001 paper on this
topic for more details.
In collaboration with Daniel Jackson, I am investigating ways to
check that a program correctly implements high-level design
properties expressed using object models. My contribution focuses
on new, very precise, program analyses that leverage the design information to
focus the analysis on properties of interest to the programmer. The National
Science Foundation is currently supporting this research through an ITR grant.
Finally, I have a range of Java analysis and optimization projects underway.
These include the exploration of techniques for implementing Real-Time Java,
space optimizations for Java programs, techniques that leverage program
analysis information to provide very fast atomic transactions, and a range
of memory system optimizations such as region-based allocation, elimination
of write barriers, and
(leveraging program analysis information about object lifetimes) explicit
I have also done research in
hardware synthesis and
credible compilation .