) (last modified Jan 3, 2000)
:
|
Thankfully, modern math packages provide convenience routines for most common operations so that loops like these never need to be written. Instead, we can write:
|
which captures the mathematical concept nicely.
But we often pay a cost for such conveniences. Many of these math packages impose a barrier on optimization. For optimal performance, sequences of operations need to be unrolled and flattened, but complex interactions between convenience routines confuse even modern compilers, resulting in suboptimal code.
Here's another example of a convenience routine. The expression
a and b. In the C++ world, we would write:
|
but with a good math package, we could just say:
|
and be done with it. Through expression templates, Lazy Containers
allow programmers to write convenience routines such as
sum(), square() and operator-
without forcing the user to incur a loss of performance.
ssd=sum(square(a1-a2)) is that when naively implemented,
they result in extremely inefficient programs.
In a naively implemented package, the expression a1-a2
returns a new array whose elements are the differences between
elements of a1 and a2. Then
square(a1-a2) squares the elements of this array,
resulting in a new array, which is finally summed over. With this
methodology, every array operation fully computes a temporary
array which is passed on to the next operation.
This not only results in a lot of overhead due to calls to memory
allocation routines, but also causes a horrendous number of cache
misses, takes very poor advantage of modern pipelines, and results in
unnecessary instructions for writing to the temporary arrays.
In contrast, the brute-force, hand-written C++ loop above is very cache efficient, does not produce temporaries, and is easily pipelined and unrolled.
Lazy Containers allows programmers to write simple to understand
expression such as ssd = sum(square(a1-a2)) without
incurring the cost hidden costs of of conceptual elegance.
As is explained later, the trick is to defer operations until their
result is needed, and only then to generate machine code. So instead
of creating and filling in a temporary array with differences of
elements, the expression a1-a2 returns a promise
to compute the differences between elements of the arrays. The
square() function in turn modifies this promise by
squaring the promise to return differences. Finally, the
sum() function forces the promise, resulting in
the sum of the squared differences being computed.
By making promises to compute, each sub-expression only had to return
a small promise object instead of a large array object. No arrays
needed to be allocated or filled in, and the computation happens
efficiently because the compiler can reorder all the operations and
place them in a single loop when sum() requests actual
values. In fact, the promise object itself is never actually created:
optimizers are smart enough to figure out what's going on and optimize
it away.
Here's an example which explains how Lazy Containers are different from traditional math packages. The factorial of a number N is the product of all numbers from 1 to N. A C++ routine for computing the factorial of N is:
|
Using Lazy Containers, we could create a promise to generate all numbers from 1 to N, then multiply all of these numbers together:
|
product() is the same product function we used earlier,
and just as before, we are passing it a promise. But instead of
promising elements of an array in memory, we are promising it numbers
ranging from 1 to N. sequence() doesn't actually return
an array of N elements: it returns an 8 byte entity which is optimized
away at compilation. The resulting machine code looks identical to the
one produced by the loop above.
Here are timings for computing the euclidian distance between two 10000 dimensional vectors. Each version of the computation is run 1000 times. The results are the average number of nanoseconds required for each pass.
| Name | % time | seconds | body ns/call | total ns/call |
| hand | 5.9 | 0.11 | 109890.11 | 109890.11 |
| lazy | 11.8 | 0.22 | 219780.22 | 219780.22 |
| hand2 | 19.8 | 0.37 | 369630.37 | 369630.37 |
| traditional2 | 19.8 | 0.37 | 369630.37 | 369630.37 |
| traditional | 42.8 | 0.54 | 539460.54 | 798941.58 |
The "self ns" column shows the number of nanoseconds it takes to execute each call, excluding calls to auxilliary functions. "total ns" is the running time of the function as is, including other function calls it might make. "% time" is based on "total ns."
The routine hand is a
handcoded loop whose inner statement is ssd +=
square(*a1it-*a2it). It does not create any temporary arrays
and is the most efficient thing I could think of doing.
lazy uses lazy containers. The function simply calls:
return sum(square(x - y)).
hand2
is one attempt at identifying a slowdown in traditional packages. It
uses a temporary array which is created outside the function
(ie, the temporaries are not recreated on each
execution). hand2 first stores the differences between
the two vectors in one temporary array, then squares these and stores
the result in a second temporary array. This array is finally summed
up.
The traditional2
run was to make sure that the traditional method didn't suffer from
poor inlining and that the compiler's inliner was doing its job. This
routine uses external temporaries as hand2 did (that is,
temporaries are allocated once outside the function and never inside
the function). This function calls a routine to subtract the arrays,
storing the result in a temporary. Then calls another function to
square the elements of this temporary, storing the result in another
array. The result is the sum of all these elements. hand2
is basically a hand-inlined version of this function. Most math
packages provide an interface like this (BLAS, LAPACK, Intel Math
Library, etc).
traditional uses simple routines that mimic what a
traditional package might do. It simply executes return
sum(square(subtract(x,y))), where square(),
subtract(), and sum() are implemented in the
old-school fashion: subtract() returns a temporary array
containing the difference of its arguments. square()
squares this array, returning a new array, and sum()
finally sums these up.
Timings are based on the output of gprof on a Pentium III. The compile
command was: g++ -DNTPM_INCLUDE_TEMPLATES -I.. -Wall -ggdb
-ftemplate-depth-30 -pipe -O3 -fstrict-aliasing -finline-functions
-finline-limit-40000 -funroll-loops -frerun-cse-after-loop
-frerun-loop-opt -pg -c -o speed.o speed.cc.
The hand-coded loop is twice as fast as Lazy Containers, and 7 times faster than a naive implementation of a traditional package.
Since hand2 and traditional2 do not create
temporaries on each call and traditional does, these
results indicate that creating of temporaries reduces performance by
about a factor of 2. However, writing to temporaries (whether
allocated once or always) requires extra cycles, causes cache misses,
and forces poor instruction scheduling, which would account for the
factor of 3.3 difference between the fast hand-coded loop and the
alternatively hand-coded loop.
Lazy Containers suffers form none of the issues raised by the creation
of temporaries and runs 3.6 times faster than the naive traditional
code. I consider hand2 and traditional2 to
be best performance any traditional package could ever provide. Yet
Lazy Containers shaves 40% off the execution time of these.
Lazy evaluation system such as this one are typically implemented with runtime laziness in mind. However, in this case, we know that all expressions will be used at runtime, so runtime laziness is unnecessary. The intent behind the laziness in Lazy Containers is to provide hints to the compiler by breaking abstraction barriers through the use of promises. Promises are actually expression templates.
Specifically, the expression a1-a2 returns a promise of type
Promise<binary_op_iterator<Container1::const_iterator,Container2::const_iterator,minus<Container1::value_type>>>
Promise template class behaves like a standard STL
container and defines iterators. Iterators of a Promise
maintain iterators to a1 and a2 and return
the difference of these dereferenced iterators when themselves
dereferenced.
The square function further processes the promise to
compute differences. It returns a promise whose iterators square the
iterators of the promise passed to it. More details follow.
The expression -array returns a promise to compute the
negation of every eleemnt of array. The expression
returns an object of type
Promise<unary_op_iterator<Container::const_iterator,
negate<Container::value_type>>. As mentioned before, a
promise is a container whose iterators perform the promised
computation. Promise<...>::const_iterator for a negation
promise defines the dereference operator operator*() as:
|
where it is an iterator into array. The
promise iterator for operator- dereferences an iterator
into its argument, and returns its negation.
Binary operators are implemented in a similar
way. operator*() for the addition promise (the result of
calling ar1+ar2) is:
|
where it1 and it2 are iterators into
ar1 and ar2.
Since the Promise template promises operations on any
container, in the above examples, array, ar1 and
ar2 could all be promises themselves. The type of the
expression square(a1-a2) is
typedef
Promise<binary_op_iterator<Container1::const_iterator,Container2::const_iterator,minus<Container1::value_type>>>
difference_promise;
Promise<unary_op_iterator<difference_promise::const_iterator,squareop<difference_promise::value_type>>>
This type encodes in it the type of its operands, which are
promises. The first statement declares a promise to return the
differences between objects contained by containers of type
Container1 and Container2. The second line
is the type of a promise which applies the elementwise
squareop() function to elements from the container
difference_promise. Hence a history of promises is built
up as more operations are applied to the initial operands. Finally,
when it comes time to force a promise, all the type information
encoded in the template name results in a sequence of derferencing of
iterators, all of which can be inlined together.
NTPM or
NTPMai namespaces. Before processing any data, the data
must first be converted into a promise which returns the data when
forced. To create a simple promise, call promise(),
passing it a beginning and an end iterator. For example, to turn an
STL vector or a C array into a promise, you would call:
|
Promises on data are easy to define, but you would normally simply pass them to a function which takes an unspecific template argument. If you really need to name your promises when you create them, you could say:
|
As before, Promise's are containers, so you could call
prom.begin() to get a lazy iterator into the promise.
The library is still fledgling. So far, it defines the following operations:
| Name | Description |
| promise(container) | Promises the elements of the container container. |
| promise(beg,end) | Promises the elements of the
container [beg,end). |
| make_matrix_promise(...) | Promises a matrix. These are a special kind of promise which maintain information about how the data should be organized in a matrix. Support for matrix promises is incomplete, so the documentation is intentionally sparse. |
| constant(Z, [count]) | Promises to return the value Z. The promise is a lazy container which contains count virtual instances of Z. If count is omitted or -1, the container is of infinite size. Note that only one instance of Z is stored, regardless of the value of count. |
| counter(min,[max,step]) | Promises increasing values
ranging frommin to maxstep. If max is omitted, returns an infinite
sequence. The step size defaults to 1. |
| cast(Z,a) | Promises to cast every element of promise a to an object which has the same type as object Z. |
| -a | Promises the negation of all elements of a. |
| square(a) | Promises the square of all elements of a. |
| a+b | Returns a promise to add every element of promises a and b. |
| a-b | Returns a promise to subtract every element of promises a and b. |
| a>b | Promises a container of booleans. Each of these will be true if the corresponding element in a is greater than the corresponding element in b. |
| a<b | Promises a container of booleans. Each of these will be true if the corresponding element in a is less than the corresponding element in b. |
| weight(a,b) | Promises the product of
a and b. This is an elementwise
product, not a dot-product. To avoid ambiguity this function is not
called *. |
| multiply_matrix_vector(M,V) | Promises the
product of matrix promise M and vector promise
V. The results promise is a simple promise (see
below for how to create matrix promises). |
| sum, mean, product, cross, dot, norm | These routines don't actually return a promise. They are routines which operate on arbitrary STL containers. Since Lazy Containers behave like STL containers, you can use these functions on them. |