The exact dimensions of a transformed array often depend on the number
of processors, which may not known at compile time. For example, if
P is the number of processors and d is the size of the dimension,
the strip sizes used in CYCLIC and BLOCK distributions are
P and
, respectively. Since our compiler
outputs C code, and C does not support general multi-dimensional
arrays with dynamic sizes, our compiler declares the array as a linear
array and uses linearized addresses to access the array elements.
As discussed above,
strip-mining a d-element array dimension with strip size b
produces a subarray of size
. This
total size can be greater than d, but is always less than d + b -1.
We can still allocate the array statically provided that we can
bound the value of the block size. If
is the largest possible block
size, we simply need to add
elements to the original
dimension.
Producing the correct array index functions for transformed arrays is rather straightforward. However, the modified index functions now contain modulo and division operations; if these operations are performed on every array access, the overhead will be much greater than any performance gained by improved cache behavior. Simple extensions to standard compiler techniques such as loop invariant removal and induction variable recognition can move some of the division and modulo operators out of inner loops[3]. We have developed an additional set of optimizations that exploit the fundamental properties of these operations[14], as well as the specialized knowledge the compiler has about these address calculations. The optimizations, described below, have proved to be important and effective.
Our first optimization takes advantage of the fact that a processor often addresses only elements within a single strip-mined partition of the array. For example, the parallelized SPMD code for the second loop nest in Figure 1(a) is shown below.
b = ceiling(N/P)
C distribute A(block,*)
REAL A(0:b-1,N,0:P-1)
DO 20 J = 2, 99
DO 21 I = b*myid+1, min(b*myid+b, 100)
A(mod(I-1,b),J,(I-1)/b) = ...
21 CONTINUE
20 CONTINUE
The compiler can determine that within the range b*myid+1
I
min(b*myid+b,100), the expression (I-1)/b
is always equal to myid. Also, within this range, the
expression mod(I-1,b) is a linear expression. This information
allows the compiler to produce the following optimized code:
idiv = myid
DO 20 J = 2, 99
imod = 0
DO 22 I = b*myid+1, min(b*myid+b, 100)
A(imod,J,idiv) = ...
imod = imod + 1
22 CONTINUE
20 CONTINUE
It is more difficult to eliminate modulo and division operations when the data accessed in a loop cross the boundaries of strip-mined partitions. In the case where only the first or last few iterations cross such a boundary, we simply peel off those iterations and apply the above optimization on the rest of the loop.
Finally, we have also developed a technique to optimize modulo and division operations that is akin to strength reduction. This optimization is applicable when we apply the modulo operation to affine expressions of the loop index; divisions sharing the same operands can also be optimized along with the modulo operations. In each iteration through the loop, we increment the modulo operand. Only when the result is found to exceed the modulus must we perform the modulo and the corresponding division operations. Consider the following example:
DO 20 J = a, b
x = mod(4*J+c, 64)
y = (4*J+c)/64
...
20 CONTINUE
Combining the optimization described with the additional information in this example that the modulus is a multiple of the stride, we obtain the following efficient code:
xst = mod(c, 4)
x = mod(4*a+c, 64)
y = (4*a+c)/64
DO 20 J = a, b
...
x = x+4
IF (x.ge.64) THEN
x = xst
y = y + 1
ENDIF
20 CONTINUE