The relentless quest for higher performance has made current computer systems exceedingly complex. This complexity lies in both hardware and software systems, making it difficult, if not impossible, to further improve total system performance. Furthermore, overall complexity is driving up design, development, implementation, and testing costs while decreasing system reliability. The primary motivation of my research is to discover novel approaches to improve performance and increase the capabilities of modern computer systems without unduly increasing the complexity faced by application developers, compiler writers, or computer architects.
My approach to research is first to understand and predict the changes in user requirements and underlying technology, extrapolate their stresses on the current evolutionary path, identify possible paradigm shifts that can drastically simplify, improve and expand the system, and test these hypotheses by developing a working prototype within the current constraints. I also believe that building real systems is critical in conducting credible systems research. In many instances, the critical flaws of a paper design or a contrived implementation are only revealed under the stresses of the real world. Thus, from the Raw microprocessor, which has working silicon, to the DynamoRIO dynamic code modification system, which seamlessly works on large applications such as Microsoft IIS and Word, I strive to build systems that are as close to the real world as possible. This approach has grounded our work in reality, provided us with many critical insights and produced many important and high-impact results.
Language and Compiler for the Streaming Application Domain:
Although streaming data is the dominant data type and more than 90% of processor cycles today are used in handling streaming data, there is no established programming language or compiler support for streams. We are developing StreamIt, a high-level stream language and an optimizing compiler system. The Streaming language introduces some novel features. StreamIt is the first Synchronous Data-Flow (SDF) language to support structured streams, leading to hierarchical and parameterized construction of stream graphs. StreamIt also introduces a message construct for precise event handling as well as a formal notion of time for distributed stream programs [CC 2002].
By implementing a Stream-aware compiler for the Raw architecture, we have demonstrated the ability to automatically partition, lay out and schedule an architecture-independent stream graph onto a communication exposed architecture. Building the StreamIt compiler required us to innovate many new analyses and transformations. For example, the StreamIt language provides a peek construct so that the programmers can expose the reuse of data and transfer the burden of buffer management to the compiler. Peeking is a useful and powerful feature. However, it introduces new circular buffers and startup phases that complicate the compilation process; StreamIt is the first compiler to optimize and schedule a stream graph that includes peeking. Our optimizations include filter fusion and filter fission, which are critical for load balancing and are robust in the presence of peeking [ASPLOS 2002]. We have developed phased scheduling, a stream scheduling algorithm that not only accommodates peeking but also provides a flexible tradeoff between code size and buffer size. Another unique feature of phased scheduling is that it enables separate compilation of hierarchical graphs [LCTES 2003].
We believe that the stream programming paradigm will be at the heart of the most influential movement in programming since object oriented programming. Furthermore, StreamIt represents the first step towards establishing a portable programming paradigm for the emerging class of communication exposed architectures. [GOMACTech 2003, SIGARCH news 2002]
Specific Optimization of Stream Programs:
Although automatic optimization of DSP algorithms has been done in the context of matrix descriptors, we are the first to show that a high-level linear representation can be extracted from general-purpose code, and applied for global optimization of stream programs. We developed the first compiler that automatically analyzes the profitability and transforms a program from the time domain to the frequency domain. Our compiler is able to obtain an average of 400% performance improvement on a set of DSP benchmarks. These domain specific optimizations perform the tasks of an expert DSP programmer and can drastically reduce the complexity faced by application programmers. [PLDI 2003a]
Introspection, Optimization and Security:
In the new world of software, which heavily utilizes dynamic class loading, DLLs and interconnected components, the power and reach of static analysis is diminishing. An exciting new paradigm of dynamic program optimization, improving the performance of a program while it is being executed, is emerging. DynamoRIO, a joint development between HP Labs and my group, is a powerful dynamic code modification infrastructure capable of running existing binaries such as Microsoft Office Suite. It runs on both Windows and Linux environments. This project was the first to demonstrate that creating and managing software traces and executing the entire application using the traces is efficient, practical and universally applicable. [CGO 2003, FDDO 2001]
We are offering a free release of DynamoRIO for non-commercial use. ( http://cag.lcs.mit.edu/dynamorio ) So far more than 300 researchers have downloaded the system and are using it for host of projects in program analyses, optimization, introspection, simulation and security.
Execution via Program Shepherding:
In the Program Shepherding project, we developed a unique and effective approach to securing computer systems. The greatest threat to our modern information infrastructure is the remote exploitation of program vulnerabilities. The most serious and prevalent attacks - 31 out of 34 CERT advisories last year and the SQL slammer worm this year - gained unauthorized access to a computer system by exploiting a bug in a program that allowed insertion and execution of malicious code. To date there are no complete solutions to this problem other than reactively fixing after each and every attack or attempting to eliminate all the bugs from software.
In Program Shepherding we take a radically different approach that restricts the damage that can be done in an exploitation of a program vulnerability by monitoring all control flow in order to identify and prevent execution of malicious code. We have implemented Program Shepherding within the DynamoRIO system. The caching of code in DynamoRIO is the key to efficient, secure execution, because it allows many security checks to be performed only once, when the code is copied to the cache. We have shown that Program Shepherding is capable of efficient secure execution of any application, without modifications to the existing binary, the operating system or the hardware. More importantly, unlike all previously known techniques, Program Shepherding is able to stop all remote execution attacks identified by CERT to date as well as many other possible attack vectors [Usenix Security 2002].
Our solution is simple, effective, practical and universal. It has the potential to revolutionize how computers are secured against attacks. Program Shepherding is currently being commercialized.
the Compiler using Machine Learning:
In the Meta Optimization project we pioneered the use of machine learning to optimize the compiler itself. Compilers are typically crafted with many heuristics to solve NP-hard problems approximately and efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process, one of the most complex tasks faced by the compiler writer. Meta Optimization uses machine-learning techniques for automatically generate compiler heuristics. For example, we tested the efficacy of Meta Optimization by 'evolving' the hyperblock selection optimization. Evolving for a particular benchmark, our system builds an application-specific compiler that achieves average speedups of 23% (up to 43%). More importantly, by evolving a compiler's heuristic over several benchmarks, we have shown that it is possible to create an optimization that is effective and general-purpose. For the above optimization, the performance of a completely unrelated set of programs improved by 9% [EuroGP 2003, PLDI 2003b].
We have also shown that this technique can completely eliminate the human from heuristic design. We show that for priority-based register allocation, a heuristic that has been tuned for over 20 years with dozens of publications introducing small variations, we can toss out the state-of-the-art heuristic, start with a set of random priority functions and by evolution arrive at a solution as good as the human-created solution.
Parallelism and Storage Optimization:
After more than three decades of intense research, one of the most important unsolved problems in compilers is how to automatically find an optimal mapping of a sequential program onto a parallel architecture. Though there are many heuristic algorithms in practical systems and partial or suboptimal solutions in the literature, a theoretical framework that can fully describe the entire problem and find the optimal solution is lacking. The difficulty stems from the fact that multiple inter-related costs and constraints must be considered simultaneously to obtain an efficient executable. In particular, with the widening gap between clock speed and memory latency, and with modern memory systems becoming increasingly hierarchical, the amount of storage space required by a program can have a drastic effect on its performance. Today this task is left to the programmer, who has to understand both the application and the memory system behavior and experiment with many different configurations of the application before finding a suitable solution. We have developed the first unified mathematical framework for representing the tradeoffs between parallelism and storage. Our technique combines affine scheduling with occupancy vector analysis and incorporates general affine dependences across statements and loop nests. Using this framework, we have shown how to find a good storage mapping for a given schedule, a good schedule for a given storage mapping, and a good storage mapping that is valid for all legal schedules. Our method is currently the most powerful technique available for considering storage optimization across a range of schedules. This flexibility will be the cornerstone of any technique that simultaneously optimizes for parallelism and storage space. [PLDI 2001]
Simple Scheduling Methodology for Complex Architectures: . We have also
used the meta optimization technique to find the most effective sequence of
passes for a given architecture [LCPC 03].
Instruction scheduling on microprocessors is becoming an extraordinarily difficult problem. In almost all practical situations it is NP hard and often involves multiple intertwined constraints. Traditional scheduling frameworks handle conflicting goals and heuristics in an ad hoc manner. They either address the constraints one at a time in a sequence of phases, which introduces phase ordering problems, or use a unified algorithm that can only handle a subset of the constraints. Thus, the current state-of-the-art instruction schedulers are exceedingly complex while code generated for many programs is clearly inferior. Convergent scheduling is a new approach where a general scheduling framework, based of scheduling preferences instead of hard decisions, makes it simple to specify many arbitrary constraints and scheduling heuristics. Our critical insight is that the interaction of these simple passes, each modifying the scheduling preferences of the instructions to suite its own constraint, converges on a schedule that satisfies most of the important constraints for that problem. For example, Convergent scheduling was able to obtain an average performance improvement of 21% over the best known algorithm implemented on the Raw processor [Micro 2002]
. We have also used the meta optimization technique to find the most effective sequence of passes for a given architecture [LCPC 03].
for Multimedia Instructions:
In the Smash project, we developed compiler technology that takes advantage of wider processor data paths, which are now prevalent in modern microprocessors. In the Smash project, we debunked the conventional wisdom in using vectorization for compiling to short SIMD parallelism by introducing Superword Level Parallelism (SLP). The SLP paradigm is particularly well suited for compiling to multimedia instruction sets, and is fundamentally different from loop-level vector parallelism developed for vector supercomputers. SLP leads to simple, robust, and general compiler techniques that can greatly improve program performance. In fact, our initial experiments show that SLP compilation provides performance improvements that average 84%, and range as high as 253%. Currently the SLP techniques are used in industry such as in the IBM blue gene compiler. [PLDI 2000a, PACT 2002]
the Bits Used:
In the Bitwise project we developed compiler techniques that minimize the number of bits used to represent each operand in a program. Bitwise frees the programmer from declaring bitwidth invariants. There are many uses of bitwidth information. It increases the opportunities for superword level parallelism. It provides energy savings in power-exposed processor architectures. It is extremely useful when compiling to custom hardware. For example, by taking advantage of bitwidth information during architectural synthesis, we were able to reduce the silicon real estate by 15% - 86%, improve clock speed by 3% - 249%, and reduce power by 46% - 73%. We have found a rich opportunity for bitwidth reduction in modern multimedia and streaming application workloads. This successful demonstration will have a great impact in next-generation reconfigurable as well as embedded architectures. [PLDI 2000b]
Wire-Delay in Microprocessors:
Six years ago, Prof. Anant Agarwal and I started the Raw project to change the design philosophy of modern microprocessors fundamentally. We predicted that, in the era of billion-transistor chips running at multi-gigahertz clock speeds, the speed of light will make it impossible to build a single monolithic device by scaling the current superscalar architectures. Thus we designed the Raw processor with the memory distributed and control decentralized. Since there are no centralized hardware units making execution decisions, most of it has to be done by the compiler. We developed novel compiler technology, aware of the spatial distribution of resources required in mapping applications to such a processor. We have a working compiler that is quite successful at achieving scalable performance on Raw for a collection of benchmark applications. We have shown that instruction level parallelism (ILP) can be successfully obtained by statically orchestrating the execution across multiple tiles at compile-time. We have also shown that compile-time analysis can effectively utilize a distributed memory system. By fabricating the Raw processor we demonstrated that a university can build a complex processor. So far we haven't discovered a single defect or bug with the chip, which is currently operating at 430 MHz, exceeding the design specifications. [ISSCC 2003, HPCA 2003, IEEE Micro 2002, IEEE TOC 2001, ISCA 1999, ASPLOS 1998, IEEE Computer 1997]
Information Technologies for the Developing World:
We at MIT are at the forefront of the information revolution. These innovations were driven by the problems faced by us, which where mostly very specific to the developed world. Thus, the solutions developed to date, while sometimes applicable, have not adapted well to the developing world environment. One of my deepest desires is to address fundamental problems facing the developing world as they seek to participate in the information economy. To that end we have developed the TEK email-based search engine designed to deliver information to low-connectivity communities . Currently, we are beta testing this system in a few places including in Solomon Islands over low-bandwidth radio-based e-mail links and the Media Lab's DakNet project which delivers e-mail using transponders attached to vehicles. Recently, TEK has been receiving a lot of public attention, including on the BBC world service, Voice of America, and local newspapers and radio stations in a few developing nations. We are excited about this opportunity to provide a valuable service to these underserved communities. [ISTAS 2002, WWW 2002]