benchmarks http://people.csail.mit.edu/jaffer/CNS/benchmarks

Benchmarks


A common approach to running Scheme benchmarks involves timing a program calculating Fibonacci numbers or tower-of-Hanoi in several Scheme implementations and comparing the results. These functions are written in an exponentially recursive style which the tester hopes compilers won't optimize:
(define (Hanoi n)
  (letrec ((move-them
            (lambda (n from to helper)
              (if (> n 1)
                  (begin
                    (move-them (- n 1) from helper to)
                    (move-them (- n 1) helper to from))))))
    (move-them n 0 1 2)))
Having no side-effects and an unspecified return value, this Hanoi procedure can be reduced to:
(define (Hanoi n) (if #f #f))
Even if this shortcoming is corrected, why is interpreter performance on exponentially exploding computations interesting? Having to resort to such algorithms very likely means your program will never complete when given other than contrived inputs.

And why are - and > the only arithmetic procedures tested? The usual explanation is that Hanoi tests "interpreter speed" independent of arithmetic routines.

Nearly all production Scheme code uses integers. Interpreter support for fast arithmetic is one of the most fruitful areas to optimize; yet these benchmarks can penalize such an interpreter.

Ratiometric Performance

Assuming we have a useful benchmark task, what does comparing Scheme implementations to each other tell us? I might be the only person running SCM on a certain processor and particular Linux distribution; how do I factor out processor and memory speed differences from times reported for implementations running on entirely different architectures?

The CPU, memory speed and size, cache speed and size, operating-system, and compiler all affect the speed of a Scheme implementation. The only hope of isolating the performance of the Scheme implementation is to hold all the other factors constant; more specifically, calculate the ratio of the Scheme implementation's running time to that of the equivalent program written in C.

Scheme is run on platforms spanning an enormous range of speeds; from handheld computer to Cray. We want the C benchmark program to run long enough to amortize its startup time. Considering that interpreters run 15 to 150 times slower than C, a test taking a second on a Cray might take more than a day running interpreted (Scheme) on a PalmPilot.

One could compensate for speed differences by running tests repetitively, but there is the risk an optimizer will figure that out. And times from repetitive runs can be less informative as they multiply the effects of an algorithm with a particular input happening to fit or not into a certain number of registers or cache-lines.

Practice

In 1991-05-01 Benchmarking Scheme my approach was to choose a benchmark algorithm with known scalable (asymptotic) run times. Such benchmarks are measuring performance of a program someone might actually want to run. On faster machines it can expose instruction or data flow bottlenecks which don't occur for smaller datasets.

As I write this I am realizing that repetitive generation from a high quality pseudo-random number generator is also immune to compiler spoofing. Computing mean and variance of the number stream exercises some additional arithmetic and also scales linearly.

Here are Scheme Pseudo-Random Number Generating Benchmarks and their translations to C by schlep. The performance of SCM is discussed in Interpreter Speed.

PRNG Benchmarks
State StructureScheme SourceCSCM/C Time Ratio
Array prng-a.scm prng-a.c 69.4 = 16250/234
Bytes prng-b.scm prng-b.c 85.6 = 20200/236
Vector prng-v.scm prng-v.c 54.8 = 12830/234

Copyright © 2002 Aubrey Jaffer

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.  My actions and comments do not reflect in any way on MIT.
Computer Natural Science
agj @ alum.mit.edu
Go Figure!