5 Conclusions

We have successfully implemented a software cache coherent Split-C compiler. Our SWCC-Split-C compiler uses a directory based scheme to automatically and coherently replicate blocks of shared data.

We have demonstrated significant performance improvements for programs with redundant remote accesses or spatial locality in remote accesses. For highly localized data access patterns, the overhead of maintaining the directory outweighs the advantage of reducing the number of network transactions.

The SWCC method of directory-based software cache coherence is ideally suited to irregular applications in which the access patterns are unpredictable. Caching simplifies the application programmer's job, as it alleviates the importance of initial data layout. The matrix multiply kernel shows that caching allows the simplest implementation to be efficient, reducing complexity and therefore reducing development and debugging time. Likewise with EM3D, automatic cache coherence removes the burden of caching from the programmer.

Since the protocol overhead is so critical, we may want to allow the application programmer, or preferably the compiler, to identify which blocks would benefit from caching and only use cache coherence with those blocks. This would provide the advantage of caching without incurring unnecessary overhead.

The performance gains are also highly dependent on the size of the cache blocks, and different applications perform best with different block sizes. For example, the matrix multiply kernel performs best with the largest block size while the EM3D application performs best with the smallest block size. For this reason, the granularity should be configurable, either at compile time or dynamically. Ideally, the granularity should even be variable within an application.

Parallel programming remains a challenging task. We feel that automatic cache coherence is a useful tool in lessening the burden on the parallel programmer.

[Prev] [Next] [Index]