Motivation
The relentless quest for higher performance has made current
computer systems exceedingly complex. This complexity lies in both hardware and
software systems, making it difficult, if not impossible, to further improve
total system performance. Furthermore, overall complexity is driving up design,
development, implementation, and testing costs while decreasing system
reliability. The primary motivation of my research is to discover novel
approaches to improve performance and increase the capabilities of modern
computer systems without unduly increasing the complexity faced by application
developers, compiler writers, or computer architects.
My approach to research is first to understand and predict the changes in user requirements and underlying technology, extrapolate their stresses on the current evolutionary path, identify possible paradigm shifts that can drastically simplify, improve and expand the system, and test these hypotheses by developing a working prototype within the current constraints. I also believe that building real systems is critical in conducting credible systems research. In many instances, the critical flaws of a paper design or a contrived implementation are only revealed under the stresses of the real world. Thus, from the Raw microprocessor, which has working silicon, to the DynamoRIO dynamic code modification system, which seamlessly works on large applications such as Microsoft IIS and Word, I strive to build systems that are as close to the real world as possible. This approach has grounded our work in reality, provided us with many critical insights and produced many important and high-impact results.
A
Language and Compiler for the Streaming Application Domain:
Although streaming data is the dominant data type and more than
90% of processor cycles today are used in handling streaming data, there is no
established programming language or compiler support for streams. We are
developing StreamIt, a high-level stream language and an optimizing compiler
system. The Streaming language introduces some novel features. StreamIt is the
first Synchronous Data-Flow (SDF) language to support structured streams,
leading to hierarchical and parameterized construction of stream graphs.
StreamIt also introduces a message construct for precise event handling as well
as a formal notion of time for distributed stream programs [CC
2002].
By implementing a Stream-aware compiler for the Raw architecture, we have
demonstrated the ability to automatically partition, lay out and schedule an
architecture-independent stream graph onto a communication exposed architecture.
Building the StreamIt compiler required us to innovate many new analyses and
transformations. For example, the StreamIt language provides a peek construct so
that the programmers can expose the reuse of data and transfer the burden of
buffer management to the compiler. Peeking is a useful and powerful feature.
However, it introduces new circular buffers and startup phases that complicate
the compilation process; StreamIt is the first compiler to optimize and schedule
a stream graph that includes peeking. Our optimizations include filter fusion
and filter fission, which are critical for load balancing and are robust in the
presence of peeking [ASPLOS
2002]. We have developed phased scheduling, a stream scheduling
algorithm that not only accommodates peeking but also provides a flexible
tradeoff between code size and buffer size. Another unique feature of phased
scheduling is that it enables separate compilation of hierarchical graphs [LCTES
2003].
We believe that the stream programming paradigm will be at the heart of the most
influential movement in programming since object oriented programming.
Furthermore, StreamIt represents the first step towards establishing a portable
programming paradigm for the emerging class of communication exposed
architectures.
[GOMACTech
2003, SIGARCH
news 2002]
Domain
Specific Optimization of Stream Programs:
Although
automatic optimization of DSP algorithms has been done in the context of matrix
descriptors, we are the first to show that a high-level linear representation
can be extracted from general-purpose code, and applied for global optimization
of stream programs. We developed the first compiler that automatically analyzes
the profitability and transforms a program from the time domain to the frequency
domain. Our compiler is able to obtain an average of 400% performance
improvement on a set of DSP benchmarks. These domain specific optimizations
perform the tasks of an expert DSP programmer and can drastically reduce the
complexity faced by application programmers. [PLDI
2003a]
Modifying
Code Dynamically
for
Introspection, Optimization and Security:
In the new world of software, which heavily utilizes dynamic class loading, DLLs
and interconnected components, the power and reach of static analysis is
diminishing. An exciting new paradigm of dynamic program optimization, improving
the performance of a program while it is being executed, is emerging. DynamoRIO,
a joint development between HP Labs and my group, is a powerful dynamic code
modification infrastructure capable of running existing binaries such as
Microsoft Office Suite. It runs on both Windows and Linux environments. This
project was the first to demonstrate that creating and managing software traces
and executing the entire application using the traces is efficient, practical
and universally applicable.
[CGO
2003, FDDO
2001]
We are offering a free release of DynamoRIO for
non-commercial use. ( http://cag.lcs.mit.edu/dynamorio
) So far more than 300 researchers have downloaded the system and are using it
for host of projects in program analyses, optimization, introspection,
simulation and security.
Secure
Execution via Program Shepherding:
In the Program Shepherding project, we developed a unique and
effective approach to securing computer systems. The greatest threat to our
modern information infrastructure is the remote exploitation of program
vulnerabilities. The most serious and prevalent attacks - 31 out of 34 CERT
advisories last year and the SQL slammer worm this year - gained unauthorized
access to a computer system by exploiting a bug in a program that allowed
insertion and execution of malicious code. To date there are no complete
solutions to this problem other than reactively fixing after each and every
attack or attempting to eliminate all the bugs from software.
In Program Shepherding we take a radically different approach that restricts the
damage that can be done in an exploitation of a program vulnerability by
monitoring all control flow in order to identify and prevent execution of
malicious code. We have implemented Program Shepherding within the DynamoRIO system. The
caching of code in DynamoRIO is the key to efficient, secure execution, because
it allows many security checks to be performed only once, when the code is
copied to the cache. We have shown that Program Shepherding is capable of
efficient secure execution of any application, without modifications to the
existing binary, the operating system or the hardware. More importantly, unlike
all previously known techniques, Program Shepherding is able to stop all remote execution
attacks identified by CERT to date as well as many other possible attack vectors
[Usenix
Security 2002].
Our solution is simple, effective, practical and universal. It has the potential
to revolutionize how computers are secured against attacks. Program Shepherding
is currently being commercialized.
Optimizing
the Compiler using Machine Learning:
In the Meta Optimization project we pioneered the use of machine learning
to optimize the compiler itself. Compilers are typically crafted with many
heuristics to solve NP-hard problems approximately and efficiently. Finding a
heuristic that performs well on a broad range of applications is a tedious and
difficult process, one of the most complex tasks faced by the compiler writer.
Meta Optimization uses machine-learning techniques for automatically generate
compiler heuristics. For example, we tested the efficacy of Meta Optimization by
'evolving' the hyperblock selection optimization. Evolving for a particular
benchmark, our system builds an application-specific compiler that achieves
average speedups of 23% (up to 43%). More importantly, by evolving a compiler's
heuristic over several benchmarks, we have shown that it is possible to create
an optimization that is effective and general-purpose. For the above
optimization, the performance of a completely unrelated set of programs improved
by 9%
[EuroGP
2003, PLDI
2003b]
We have also shown that this technique can completely eliminate the human from
heuristic design. We show that for priority-based register allocation, a
heuristic that has been tuned for over 20 years with dozens of publications
introducing small variations, we can toss out the state-of-the-art heuristic,
start with a set of random priority functions and by evolution arrive at a
solution as good as the human-created solution.
Unifying
Parallelism and Storage Optimization:
After more than three decades of intense research, one of the most important
unsolved problems in compilers is how to automatically find an optimal mapping
of a sequential program onto a parallel architecture.
Though there are many heuristic algorithms in practical systems and
partial or suboptimal solutions in the literature, a theoretical framework that
can fully describe the entire problem and find the optimal solution is lacking.
The difficulty stems from the fact that multiple inter-related costs and
constraints must be considered simultaneously to obtain an efficient executable.
In particular, with the widening gap between clock speed and memory
latency, and with modern memory systems becoming increasingly hierarchical, the
amount of storage space required by a program can have a drastic effect on its
performance. Today this task is left to the programmer, who has to understand
both the application and the memory system behavior and experiment with many
different configurations of the application before finding a suitable solution.
We have developed the first unified mathematical framework for
representing the tradeoffs between parallelism and storage.
Our technique combines affine scheduling with occupancy vector analysis
and incorporates general affine dependences across statements and loop nests.
Using this framework, we have shown how to find a good storage mapping
for a given schedule, a good schedule for a given storage mapping, and a good
storage mapping that is valid for all legal schedules.
Our method is currently the most powerful technique available for
considering storage optimization across a range of schedules.
This flexibility will be the cornerstone of any technique that
simultaneously optimizes for parallelism and storage space. [PLDI
2001]
A
Simple Scheduling Methodology for Complex Architectures:
Instruction scheduling on microprocessors is becoming an extraordinarily
difficult problem. In almost all
practical situations it is NP hard and often involves multiple intertwined
constraints. Traditional scheduling frameworks handle conflicting goals and
heuristics in an ad hoc manner. They either address the constraints one at a
time in a sequence of phases, which introduces phase ordering problems, or use a
unified algorithm that can only handle a subset of the constraints. Thus, the
current state-of-the-art instruction schedulers are exceedingly complex while
code generated for many programs is clearly inferior. Convergent scheduling is a
new approach where a general scheduling framework, based of scheduling
preferences instead of hard decisions, makes it simple to specify many arbitrary
constraints and scheduling heuristics. Our critical insight is that the
interaction of these simple passes, each modifying the scheduling preferences of
the instructions to suite its own constraint, converges on a schedule that
satisfies most of the important constraints for that problem. For example,
Convergent scheduling was able to obtain an average performance improvement of
21% over the best known algorithm implemented on the Raw processor [Micro
2002]
Optimizing
for Multimedia Instructions:
In the Smash project, we developed compiler technology that takes advantage of
wider processor data paths, which are now prevalent in modern microprocessors.
In the Smash project, we debunked the conventional wisdom in using vectorization
for compiling to short SIMD parallelism by introducing Superword Level
Parallelism (SLP). The SLP paradigm is particularly well suited for compiling to
multimedia instruction sets, and is fundamentally different from loop-level
vector parallelism developed for vector supercomputers.
SLP leads to simple, robust, and general compiler techniques that can
greatly improve program performance. In fact, our initial experiments show that
SLP compilation provides performance improvements that average 84%, and range as
high as 253%. Currently the SLP techniques are used in industry such as in the
IBM blue gene compiler. [PLDI
2000a, PACT
2002]
In the Bitwise project we developed compiler techniques that minimize the number
of bits used to represent each operand in a program. Bitwise frees the
programmer from declaring bitwidth invariants. There are many uses of bitwidth
information. It increases the
opportunities for superword level parallelism. It provides energy savings in
power-exposed processor architectures. It
is extremely useful when compiling to custom hardware. For example, by taking
advantage of bitwidth information during architectural synthesis, we were able
to reduce the silicon real estate by 15% - 86%, improve clock speed by 3% -
249%, and reduce power by 46% - 73%. We have found a rich opportunity for
bitwidth reduction in modern multimedia and streaming application workloads.
This successful demonstration will have a great impact in next-generation
reconfigurable as well as embedded architectures. [PLDI
2000b]
Tolerating
Wire-Delay in Microprocessors:
Six years ago, Prof. Anant Agarwal and I started the Raw project to
change the design philosophy of modern microprocessors fundamentally. We
predicted that, in the era of billion-transistor chips running at
multi-gigahertz clock speeds, the speed of light will make it impossible to
build a single monolithic device by scaling the current superscalar
architectures. Thus we designed the Raw processor with the memory distributed
and control decentralized. Since there are no centralized hardware units making
execution decisions, most of it has to be done by the compiler. We developed
novel compiler technology, aware of the spatial distribution of resources
required in mapping applications to such a processor. We have a working compiler
that is quite successful at achieving scalable performance on Raw for a
collection of benchmark applications. We have shown that instruction level
parallelism (ILP) can be successfully obtained by statically orchestrating the
execution across multiple tiles at compile-time. We have also shown that
compile-time analysis can effectively utilize a distributed memory system. By
fabricating the Raw processor we demonstrated that a university can build a
complex processor. So far we haven't discovered a single defect or bug with the
chip, which is currently operating at 430 MHz, exceeding the design
specifications.
[ISSCC 2003, HPCA
2003, IEEE
Micro 2002, IEEE
TOC 2001, ISCA
1999, ASPLOS
1998, IEEE
Computer 1997]
Appropriate
Information Technologies for the Developing World:
We at MIT are at the forefront of the information revolution. These
innovations were driven by the problems faced by us, which where mostly very
specific to the developed world. Thus, the solutions developed to date, while
sometimes applicable, have not adapted well to the developing world environment.
One of my deepest desires is to address fundamental problems facing the
developing world as they seek to participate in the information economy. To that
end we have developed the TEK email-based search engine designed to deliver
information to low-connectivity communities . Currently, we are beta testing
this system in a few places including in Solomon Islands over low-bandwidth
radio-based e-mail links and the Media Lab's DakNet project which delivers
e-mail using transponders attached to vehicles. Recently, TEK has been receiving
a lot of public attention, including on the BBC world service, Voice of America,
and local newspapers and radio stations in a few developing nations. We are
excited about this opportunity to provide a valuable service to these
underserved communities. [ISTAS 2002,
WWW 2002]