Lazy Containers:

Efficient Compile-time Abstraction for Manipulating Collections of Objects

Using Expression Templates

Ali Rahimi () (last modified Jan 3, 2000)


1.0 Introduction

Programmatic implementations of a mathematical concepts rarely reflect the underlying mathematical concept. The following C++ loop is far cry from the concise mathematical expression :

// Container a;
Container::value_type g = 1;
for(Container::const_iterator it=a.begin(); it!=a.end(); ++it)
    g *= (*it);

Thankfully, modern math packages provide convenience routines for most common operations so that loops like these never need to be written. Instead, we can write:

Container::value_type g = product(a);

which captures the mathematical concept nicely.

But we often pay a cost for such conveniences. Many of these math packages impose a barrier on optimization. For optimal performance, sequences of operations need to be unrolled and flattened, but complex interactions between convenience routines confuse even modern compilers, resulting in suboptimal code.

Here's another example of a convenience routine. The expression

computes the square of the Euclidean distance between vectors a and b. In the C++ world, we would write:

// Container1 a; Container2 b;
Container1::value_type s = 0;
Container1::const_iterator ait;
Container2::const_iterator bit;
for(ait=a.begin(), bit=b.begin(); ait!=a.end() && bit!=b.end();
    ++ait, ++bit) {
    s += square(*ait-*bit);
}

but with a good math package, we could just say:

ssd = sum(square(a-b));

and be done with it. Through expression templates, Lazy Containers allow programmers to write convenience routines such as sum(), square() and operator- without forcing the user to incur a loss of performance.

2.0 The Concept

The main problem with expressions such as ssd=sum(square(a1-a2)) is that when naively implemented, they result in extremely inefficient programs.

In a naively implemented package, the expression a1-a2 returns a new array whose elements are the differences between elements of a1 and a2. Then square(a1-a2) squares the elements of this array, resulting in a new array, which is finally summed over. With this methodology, every array operation fully computes a temporary array which is passed on to the next operation. This not only results in a lot of overhead due to calls to memory allocation routines, but also causes a horrendous number of cache misses, takes very poor advantage of modern pipelines, and results in unnecessary instructions for writing to the temporary arrays.

In contrast, the brute-force, hand-written C++ loop above is very cache efficient, does not produce temporaries, and is easily pipelined and unrolled.

Lazy Containers allows programmers to write simple to understand expression such as ssd = sum(square(a1-a2)) without incurring the cost hidden costs of of conceptual elegance.

As is explained later, the trick is to defer operations until their result is needed, and only then to generate machine code. So instead of creating and filling in a temporary array with differences of elements, the expression a1-a2 returns a promise to compute the differences between elements of the arrays. The square() function in turn modifies this promise by squaring the promise to return differences. Finally, the sum() function forces the promise, resulting in the sum of the squared differences being computed.

By making promises to compute, each sub-expression only had to return a small promise object instead of a large array object. No arrays needed to be allocated or filled in, and the computation happens efficiently because the compiler can reorder all the operations and place them in a single loop when sum() requests actual values. In fact, the promise object itself is never actually created: optimizers are smart enough to figure out what's going on and optimize it away.

Here's an example which explains how Lazy Containers are different from traditional math packages. The factorial of a number N is the product of all numbers from 1 to N. A C++ routine for computing the factorial of N is:

int f = 1;
for(;N;--N) f *= N;

Using Lazy Containers, we could create a promise to generate all numbers from 1 to N, then multiply all of these numbers together:

int f = product(NTPMai::counter(1,N));

product() is the same product function we used earlier, and just as before, we are passing it a promise. But instead of promising elements of an array in memory, we are promising it numbers ranging from 1 to N. sequence() doesn't actually return an array of N elements: it returns an 8 byte entity which is optimized away at compilation. The resulting machine code looks identical to the one produced by the loop above.

3.0 Lazy Containers are FASTER!

Here are timings for computing the euclidian distance between two 10000 dimensional vectors. Each version of the computation is run 1000 times. The results are the average number of nanoseconds required for each pass.

Name% timesecondsbody ns/calltotal ns/call
hand 5.9 0.11 109890.11 109890.11
lazy 11.8 0.22 219780.22 219780.22
hand2 19.8 0.37 369630.37 369630.37
traditional2 19.8 0.37 369630.37 369630.37
traditional 42.8 0.54 539460.54 798941.58
Here is the source code used for the experiment.

3.1 Details

The "self ns" column shows the number of nanoseconds it takes to execute each call, excluding calls to auxilliary functions. "total ns" is the running time of the function as is, including other function calls it might make. "% time" is based on "total ns."

The routine hand is a handcoded loop whose inner statement is ssd += square(*a1it-*a2it). It does not create any temporary arrays and is the most efficient thing I could think of doing.

lazy uses lazy containers. The function simply calls: return sum(square(x - y)).

hand2 is one attempt at identifying a slowdown in traditional packages. It uses a temporary array which is created outside the function (ie, the temporaries are not recreated on each execution). hand2 first stores the differences between the two vectors in one temporary array, then squares these and stores the result in a second temporary array. This array is finally summed up.

The traditional2 run was to make sure that the traditional method didn't suffer from poor inlining and that the compiler's inliner was doing its job. This routine uses external temporaries as hand2 did (that is, temporaries are allocated once outside the function and never inside the function). This function calls a routine to subtract the arrays, storing the result in a temporary. Then calls another function to square the elements of this temporary, storing the result in another array. The result is the sum of all these elements. hand2 is basically a hand-inlined version of this function. Most math packages provide an interface like this (BLAS, LAPACK, Intel Math Library, etc).

traditional uses simple routines that mimic what a traditional package might do. It simply executes return sum(square(subtract(x,y))), where square(), subtract(), and sum() are implemented in the old-school fashion: subtract() returns a temporary array containing the difference of its arguments. square() squares this array, returning a new array, and sum() finally sums these up.

Timings are based on the output of gprof on a Pentium III. The compile command was: g++ -DNTPM_INCLUDE_TEMPLATES -I.. -Wall -ggdb -ftemplate-depth-30 -pipe -O3 -fstrict-aliasing -finline-functions -finline-limit-40000 -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -pg -c -o speed.o speed.cc.

3.2 Discussion

The hand-coded loop is twice as fast as Lazy Containers, and 7 times faster than a naive implementation of a traditional package.

Since hand2 and traditional2 do not create temporaries on each call and traditional does, these results indicate that creating of temporaries reduces performance by about a factor of 2. However, writing to temporaries (whether allocated once or always) requires extra cycles, causes cache misses, and forces poor instruction scheduling, which would account for the factor of 3.3 difference between the fast hand-coded loop and the alternatively hand-coded loop.

Lazy Containers suffers form none of the issues raised by the creation of temporaries and runs 3.6 times faster than the naive traditional code. I consider hand2 and traditional2 to be best performance any traditional package could ever provide. Yet Lazy Containers shaves 40% off the execution time of these.

4.0 Implementation

Lazy evaluation system such as this one are typically implemented with runtime laziness in mind. However, in this case, we know that all expressions will be used at runtime, so runtime laziness is unnecessary. The intent behind the laziness in Lazy Containers is to provide hints to the compiler by breaking abstraction barriers through the use of promises. Promises are actually expression templates.

Specifically, the expression a1-a2 returns a promise of type

Promise<binary_op_iterator<Container1::const_iterator,Container2::const_iterator,minus<Container1::value_type>>>

The Promise template class behaves like a standard STL container and defines iterators. Iterators of a Promise maintain iterators to a1 and a2 and return the difference of these dereferenced iterators when themselves dereferenced.

The square function further processes the promise to compute differences. It returns a promise whose iterators square the iterators of the promise passed to it. More details follow.

4.1 How they work

The expression -array returns a promise to compute the negation of every eleemnt of array. The expression returns an object of type Promise<unary_op_iterator<Container::const_iterator, negate<Container::value_type>>. As mentioned before, a promise is a container whose iterators perform the promised computation. Promise<...>::const_iterator for a negation promise defines the dereference operator operator*() as:

   value_type operator*() {
       return -*it;
   }

where it is an iterator into array. The promise iterator for operator- dereferences an iterator into its argument, and returns its negation.

Binary operators are implemented in a similar way. operator*() for the addition promise (the result of calling ar1+ar2) is:

    value_type operator*() {
	return *it1+*it2;
    }

where it1 and it2 are iterators into ar1 and ar2.

Since the Promise template promises operations on any container, in the above examples, array, ar1 and ar2 could all be promises themselves. The type of the expression square(a1-a2) is


typedef
  Promise<binary_op_iterator<Container1::const_iterator,Container2::const_iterator,minus<Container1::value_type>>>
  difference_promise;

Promise<unary_op_iterator<difference_promise::const_iterator,squareop<difference_promise::value_type>>>

This type encodes in it the type of its operands, which are promises. The first statement declares a promise to return the differences between objects contained by containers of type Container1 and Container2. The second line is the type of a promise which applies the elementwise squareop() function to elements from the container difference_promise. Hence a history of promises is built up as more operations are applied to the initial operands. Finally, when it comes time to force a promise, all the type information encoded in the template name results in a sequence of derferencing of iterators, all of which can be inlined together.

5.0 Using Lazy Containers

All lazy container functions are declared in the NTPM or NTPMai namespaces. Before processing any data, the data must first be converted into a promise which returns the data when forced. To create a simple promise, call promise(), passing it a beginning and an end iterator. For example, to turn an STL vector or a C array into a promise, you would call:

vector<float> avec;
process(NTPMai::promise(avec.begin(), avec.end()));

char data[400];
process(NTPMai::promise(data,data+400));

Promises on data are easy to define, but you would normally simply pass them to a function which takes an unspecific template argument. If you really need to name your promises when you create them, you could say:

Container avec;
Promise<Container::const_iterator> prom1 = promise(avec.begin(),
                                                         avec.end());
char data[400];
Promise<char*> prom2 = promise(data,data+400);

As before, Promise's are containers, so you could call prom.begin() to get a lazy iterator into the promise.

The library is still fledgling. So far, it defines the following operations:
NameDescription
promise(container)Promises the elements of the container container.
promise(beg,end)Promises the elements of the container [beg,end).
make_matrix_promise(...)Promises a matrix. These are a special kind of promise which maintain information about how the data should be organized in a matrix. Support for matrix promises is incomplete, so the documentation is intentionally sparse.
constant(Z, [count])Promises to return the value Z. The promise is a lazy container which contains count virtual instances of Z. If count is omitted or -1, the container is of infinite size. Note that only one instance of Z is stored, regardless of the value of count.
counter(min,[max,step])Promises increasing values ranging frommin to maxstep. If max is omitted, returns an infinite sequence. The step size defaults to 1.
cast(Z,a)Promises to cast every element of promise a to an object which has the same type as object Z.
-aPromises the negation of all elements of a.
square(a)Promises the square of all elements of a.
a+bReturns a promise to add every element of promises a and b.
a-bReturns a promise to subtract every element of promises a and b.
a>bPromises a container of booleans. Each of these will be true if the corresponding element in a is greater than the corresponding element in b.
a<bPromises a container of booleans. Each of these will be true if the corresponding element in a is less than the corresponding element in b.
weight(a,b)Promises the product of a and b. This is an elementwise product, not a dot-product. To avoid ambiguity this function is not called *.
multiply_matrix_vector(M,V)Promises the product of matrix promise M and vector promise V. The results promise is a simple promise (see below for how to create matrix promises).
sum, mean, product, cross, dot, normThese routines don't actually return a promise. They are routines which operate on arbitrary STL containers. Since Lazy Containers behave like STL containers, you can use these functions on them.
See arithmetic_iterators.cc for an example of how to use some of these routines.

Grab a targz or browse the directory.