1 Introduction

As application and data demands increase, parallel computing is becoming increasingly pervasive. However, Despite the growing demand for parallel applications, programming parallel computers remains quite difficult and error prone.

There are two main paradigms for programming parallel systems, message-passing and shared address space. Message passing paradigms, such as MPI, provide the advantage of portability. They also give the application programmer more control in optimizing communication patterns, since all communication operations are explicit calls to the message-passing library.

Shared address space programming is conceptually simpler, and hides much of the detail of communication from the programmer. However, for just this reason, it is difficult for the programmer to examine and optimize the communication patterns. It is thus very important for the system to provide an efficient execution of the program, either through compiler or runtime system optimizations.

One approach to providing an efficient shared address space system has been to build shared memory multiprocessors. Usually bus or directory based, these systems use caching to provide efficient access to memory. All addresses are treated as shared and hardware protocols are run to keep the cache contents coherent.

However, this approach to parallel computing requires large investments in expensive SMPs. An alternative approach is to link together commodity workstations in a network of workstations (NOW). A NOW is a natural fit for parallel computing using message passing. To support a shared address space programming model on a system with distributed memory, some software layer must provide the illusion of a shared address space.

There are two main approaches to providing a shared address space in a distributed memory environment. The first approach is to simulate the hardware solution, caching remote values and using some sort of directory protocol to maintain coherence across processors. This approach has been implemented in several systems such as Shasta [11] and Tempest [9].

The Split-C [3,4] programming environment takes a different approach, and provides no caching. Each variable is identified as local or global by the programmer, and global variables consist of a processor number and a local address on that processor. Accesses to global variables in the user program are translated by the Split-C compiler into function calls to the Split-C library. At run time, the Split-C library function checks if the global variable is local to the processor; if so a simple memory access is performed. If the global access is to a remote memory location, a message is sent to the owning processor to complete the transaction. This makes the Split-C system much simpler than one with automatic and coherent replication. Unfortunately, it places a much greater demand on the application programmer to provide efficient data distribution and access.

It is unclear which type of software distributed shared memory (DSM) system is more appropriate. If redundant remote accesses are uncommon, or if the programmer can easily "cache" the data within the application, the extra overhead associated with maintaining caches and directories to provide coherency may outweigh the advantage of caching the remote accesses.

To examine this issue, we modified Split-C to provide coherent caching of remote data. We implement a directory structure that tracks the usage of blocks of memory and maintains coherency. We then compare the performance of regular Split-C with SWCC-Split-C, our Software Cache Coherent version of Split-C.

The next section discusses the overall framework and design of our shared memory system. Section 3 describes the applications we used to measure the performance of the various Split-C implementations and gives the results of our experiments. In section 4, we further examine the results in relation to other systems and suggest future areas for exploration.

[Prev] [Next] [Index]