A key problem that complicates the automatic parallelization of irregular, object-oriented computations written in languages such as Java and C++ is the pervasive use of irregular linked data structures such as lists, trees and graphs. Our research attacks this problem using a new analysis technique called commutativity analysis.
Commutativity analysis exploits the structure of the object-oriented paradigm to view the computation as consisting of operations on objects. It then analyzes the computation at this granularity to determine if operations commute (i.e. generate the same result regardless of the order in which they execute). If all of the operations in a given computation commute, the compiler can automatically generate parallel code. The generated code uses mutual exclusion locks to make operations in parallel phases execute atomically.
Our experimental results demonstrate the potential of this approach. We have built a prototype compiler that uses commutativity analysis as its primary analysis technique. This compiler accepts an unannotated program written in a subset of C++ and automatically generates a parallel C++ program that performs the same computation. In addition to the parallelization transformation, we also implemented a variety of synchronization optimizations . We used the compiler to automatically parallelize several complete computations:
Barnes-Hut: Barnes-Hut is a hierarchical N-body solver. It consists of approximately 1500 lines of serial C++ code. It performs well, in part, because it employs a sophisticated pointer-based data structure: a space subdivision tree that dramatically improves the efficiency of a key phase in the algorithm. Although it is considered to be an important, widely studied computation, all previously existing parallel versions were parallelized by hand using low-level, explicitly parallel programming systems. We are aware of no other compiler that is capable of automatically parallelizing this computation. On a 24 processor SGI Challenge XL, the automatically generated parallel code runs over 14 times faster than the original serial program.
Water: Water simulates a collection of water molecules in the liquid state. It consists of approximately 1850 lines of serial C++ code. On a 12 processor SGI Challenge XL, the automatically generated parallel code runs over 18 times faster than the original serial program.
String: String is a seismic computation that generates a velocity model of the geology between two oil wells. It consists of approximately 2050 lines of serial C++ code. On a 24 processor SGI Challenge XL, the automatically generated parallel code runs over 19 times faster than the original serial program.
The following papers present research related to commutativity analysis:
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
Martin C. Rinard and Pedro C. Diniz
ACM Transactions on Programming Languages and Systems
Volume 19, Number 6 (November 1997), pp. 942-991.
Copyright 1997 by ACM, Inc.
Commutativity Analysis:
A New Analysis Framework for Parallelizing Compilers
Martin C. Rinard
Pedro C. Diniz
Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation
Philadelphia, Pennsylvania, May 1996
Verification of Semantic Commutativity Conditions and
Inverse Operations on Linked Data Structures
Deokhwan Kim and Martin Rinard
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2011)
San Jose, California June 2011
The following papers present research on related optimization techniques:
Eliminating Synchronization Bottlenecks in Object-Based
Programs Using
Adaptive Replication
Martin C. Rinard and Pedro C. Diniz
1999 ACM International Conference on Supercomputing
Rhodes, Greece June 1999
The full version of this paper is available.
Lock Coarsening: Eliminating Lock Overhead in
Automatically Parallelized Object-based Programs
Pedro C. Diniz and Martin C. Rinard
Journal of Parallel and Distributed Computing
Volume 49, Number 2, (March 1998), pp. 218-244.
Effective Fine-Grain Synchronization For Automatically Parallelized
Programs Using Optimistic Synchronization Primitives
Martin C. Rinard
Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming
Las Vegas, Nevada June 1997
Dynamic Feedback: An Effective Technique for Adaptive Computing
Pedro C. Diniz and Martin C. Rinard
Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation
Las Vegas, Nevada June 1997
Synchronization Transformations for Parallel Computing
Pedro C. Diniz and Martin C. Rinard
Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of
Programming Languages
Paris, France January 1997
Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs
Pedro C. Diniz and Martin C. Rinard
Languages and Compilers for Parallel Computing, Ninth International Workshop
San Jose, California August 1996
For more information, see my list of papers.
Efficient mutual exclusion synchronization is one of the fundamental requirements of effective parallel computing. The tasks in fine-grain parallel computations, for example, need fast synchronization for efficient control of their frequent interactions. Efficient synchronization also promotes the development of reliable parallel software because it allows programmers to structure programs as a set of synchronized operations on fine-grain objects. This development methodology helps programmers overcome the challenging problems (nondeterministic behavior, deadlock, etc.) that complicate the development of parallel software.
Efficient mutual exclusion synchronization is also important for modern multithreaded languages such as Java. In this context, a primary issue is the need to make class libraries thread-safe. The standard mechanism is to augment each object with a mutual exclusion lock. Each operation on the object acquires the lock, accesses the object, then releases the lock. This mechanism ensures that the operation executes atomically even in the face of concurrent accesses by multiple parallel threads.
We have developed several optimizations for parallel and multithreaded computations that contain mutual exclusion synchronizations. These optimizations eliminate mutual exclusion synchronization, reduce its frequency, replace it with more efficient optimistic synchronization primitives, and replicate objects to eliminate the need to synchronize.
Synchronization Elimination: If an object is never accessed by more than one thread, there is no need to synchronize its operations. We have developed a combined pointer and escape analysis that is designed, in part, to find objects that do not escape from their allocating thread. The compiler can then generate customized, synchronization-free versions of the operations that execute on such objects. Our OOPSLA '99 paper on this topic presents the algorithm and provides experimental results from an implementation in a dynamic compiler for Java. For our benchmark programs, the algorithms eliminate between 24% and 67% of the synchronization operations.
Lock Coarsening: The goal of lock coarsening is to reduce the frequency with which the computation acquires and releases locks. We have developed two approaches: data lock coarsening, which associates one lock with multiple objects that tend to be accessed together, and computation lock coarsening, which coalesces multiple critical sections that acquire and release the same lock multiple times into a single critical section that acquires and releases the lock only once. Our experimental results show that lock elimination can significantly reduce the synchronization overhead in programs that use mutual exclusion synchronization.
Our JPDC paper presents an algorithm that performs both data and computation lock coarsening for object-based programs. Our POPL '97 paper presents a set of general transformations that move lock constructs both within and between procedures to coarsen the lock granularity.
Dynamic Feedback: There is a trade-off associated with lock coarsening: increasing the granularity reduces the synchronization overhead, but also increases the size of the critical regions, which may reduce the amount of available concurrency. We have developed an algorithm that generates several different versions of the program; the versions differ in how aggressively they apply the lock coarsening transformation. When the program runs, the generated code samples the performance of each version, then uses the version with the best performance. The code periodically resamples to adapt to changes in the best version. Our TOCS paper presents the algorithm and experimental results. We also published a PLDI '97 paper on this topic.
Optimistic Synchronization: Another way to reduce the synchronization overhead is to replace mutual exclusion locks with efficient optimistic synchronization primitives such as load linked/store conditional. Our PPOPP '97 paper on this topic presents our experience using optimistic synchronization in the context of a parallelizing compiler for object-based programs. Our results show that using optimistic synchronization can improve the performance and significantly reduce the amount of memory required to execute the computation.
Adaptive Replication: In many programs it is possible to eliminate synchronization bottlenecks by replicating frequently accessed objects that cause these bottlenecks. Each thread that updates such an object creates its own local replica and performs all updates locally on that replica with no synchronization and no interaction with other threads. When the computation needs to access the final value of the object, it combines the values stored in the replicas to generate the final value. A key question is determining which objects to replicate - replicating all objects can cause significant and unnecessary memory and computation overheads. We have developed a technique that dynamically measures the contention at objects to replicate only those objects that would otherwise cause synchronization bottlenecks.
Our ICS '99 paper presents experimental results from an implementation of this technique. The results show that, for our set of benchmark programs, adaptive replication can improve the overall performance by up to a factor of three over versions that use no replication, and can reduce the memory consumption by up to a factor of four over versions that fully replicate updated objects. Furthermore, adaptive replication never significantly harms the performance and never increases the memory usage without a corresponding increase in the performance.
Here are several papers that present research related to synchronization optimizations:
Compositional Pointer and Escape Analysis for Java Programs
John Whaley and Martin Rinard
Proceedings of the 14th Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications
Denver, CO November 1999
Lock Coarsening: Eliminating Lock Overhead in
Automatically Parallelized Object-based Programs
Pedro C. Diniz and Martin C. Rinard
Journal of Parallel and Distributed Computing
Volume 49, Number 2, (March 1998), pp. 218-244.
Synchronization Transformations for Parallel Computing
Pedro C. Diniz and Martin C. Rinard
Proceedings of the Twenty-Fourth Annual ACM Symposium on Principles of
Programming Languages
Paris, France January 1997
Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs
Pedro C. Diniz and Martin C. Rinard
Languages and Compilers for Parallel Computing, Ninth International Workshop
San Jose, California August 1996
Eliminating Synchronization Overhead in Automatically
Parallelized Programs Using Dynamic Feedback
Pedro C. Diniz and Martin C. Rinard
ACM Transactions on Computer Systems
To Appear
Copyright 1999 by ACM, Inc.
Dynamic Feedback: An Effective Technique for Adaptive Computing
Pedro C. Diniz and Martin C. Rinard
Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation
Las Vegas, Nevada June 1997
Effective Fine-Grain Synchronization For Automatically Parallelized
Programs Using Optimistic Synchronization Primitives
Martin C. Rinard
Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming
Las Vegas, Nevada June 1997
Eliminating Synchronization Bottlenecks in Object-Based
Programs Using
Adaptive Replication
Martin C. Rinard and Pedro C. Diniz
1999 ACM International Conference on Supercomputing
Rhodes, Greece June 1999
The full version of this paper is available.
For more information, see my list of papers.
Divide and conquer algorithms solve problems by breaking them up into smaller subproblems, recursively solving the subproblems, then combining the results to generate a solution to the original problem. A simple algorithm that works well for small problem sizes terminates the recursion. Good divide and conquer algorithms exist for a large variety of problems, including sorting, matrix manipulation, and many dynamic programming problems. Divide and conquer algorithms have several appealing properties that make them a good match for modern parallel machines.
Parallelism: Divide and conquer algorithms tend to have a lot of inherent parallelism. Once the division phase is complete, the subproblems are usually independent and can therefore be solved in parallel. Moreover, the recursive structure of the algorithm naturally leads to recursively generated concurrency, which means that even the divide and combine phases execute in parallel with divide and combine phases from other subproblems. This approach typically generates more than enough concurrency to keep the machine busy.
Cache Performance: Second, divide and conquer algorithms also tend to have good cache performance. Once a subproblem fits in the cache, the standard recursive solution reuses the cached data until the subproblem has been completely solved. Because most of the work takes place deep in the recursive call tree, the algorithm usually spends most of its execution time running out of the cache. Furthermore, divide and conquer algorithms naturally work well with a range of cache sizes and at all levels of the memory hierarchy. As soon as a subproblem fits into one level of the memory hierarchy, the algorithm runs out of that level (or below) until the subproblem has been solved. Divide and conquer programs therefore automatically adapt to different cache hierarchies, and tend to run well without modification on whatever machine is available.
In spite of these advantages, divide and conquer algorithms present an imposing set of analysis challenges. The natural formulation of these algorithms is recursive. For efficiency reasons, programs often use dynamic memory allocation to match the sizes of the data structures to the problem size, and often use pointers into arrays and pointer arithmetic to identify subproblems. Moreover, traditional analyses for parallelizing compilers are of little or no help for this class of programs --- they are designed to analyze loop nests that access arrays using affine array index expressions, not recursive procedures that use pointers and pointer arithmetic.
The key analysis problem for divide and conquer programs is characterizing how procedures access subregions of allocated memory blocks. This analysis must be symbolic, extracting bounds for the accessed regions in terms of the parameters of the procedure. So, for example, the analysis must extract facts of the form "the procedure f(double *p, int n) accesses the region of memory from p to p+n-1". In general, procedures may use pointers and pointer arithmetic to access multiple potentially overlapping regions, significantly complicating the analysis.
We have developed the following analyses and optimizations:
A new analysis that symbolically characterizes accessed regions of memory blocks in divide and conquer programs. Because the analysis computes symbolic bounds for both pointers and array indices, it can be used to solve a wide variety of program analysis problems. To avoid problems associated with the use of traditional fixed-point techniques, the analysis formulates the problem as a system of symbolic constraints between multivariate polynomials in the program variables. It then reduces the constraint system to a linear program; the solution to the linear program yields a solution to the original symbolic constraint system.
We used this algorithm to solve a range of analysis problems, including static race detection for parallel divide and conquer programs, automatic parallelization of sequential divide and conquer programs, static detection of array bounds violations, and elimination of array bounds checks. Basically, the analysis can verify that the program is completely free of a wide range of memory access anomalies that can block solutions to these analysis problems. See our PLDI 2000 paper on this topic for more details.
A new algorithm for increasing the granularity of the base case in divide and conquer programs. This algorithm unrolls recursive calls to reduce control flow overhead and increase the size of basic blocks in the computation. The result is a significant increase in the performance. See our LCPC 2000 paper on this topic for more details.
The following papers present research related to this topic:
Symbolic Bounds Analysis of Pointers, Array Indices, and Accessed Memory Regions
Radu Rugina and Martin C. Rinard
Proceedings of the ACM SIGPLAN '00 Conference on Programming Language Design and
Implementation
Vancouver, CA June 2000
Automatic Parallelization of Divide and Conquer Algorithms
Radu Rugina and Martin C. Rinard
Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming
Atlanta, GA May 1999
Automatic Parallelization of Divide and Conquer Algorithms
This set of slides presents a new approach that formulates both the
intraprocedural and interprocedural analyses as symbolic constraint
systems, then solves the constraint systems by reducing each
one to a linear program.
For more information, see my list of papers.
Jade is an implicitly parallel programming language for irregular, coarse-grain parallelism.