wild hare http://people.csail.mit.edu/jaffer/CNS/interpreter-speed

Interpreter Speed


Origins of SCM

Trying to write and debug the JACAL (Symbolic Math System) over a 1200.Bd modem was trial-by-patience. So in 1990 I acquired a Pixel 100/AP with the goal of running (the Algorithmic Language) Scheme for JACAL development.

I spent two weeks porting MITScheme to the Pixel's System 7 Unix, but after resolving all the C issues, the program was just too large for the Pixel's 4.MB(?) RAM.

Being fluent in Pascal and FORTRAN, I toyed with the idea of writing a small Scheme interpreter. The Pixel had only a C compiler. C's lack of bounds (or other) checking made it practically unusable for someone (me) who frequently codes off-by-one errors.

George J. Carrette's tiny SIOD: Scheme in One Defun compiled and ran easily on the Pixel, but wasn't a full Scheme implementation. So I set about the converting SIOD into the last C program I would ever need to write: a R4RS-compliant Scheme interpreter.

History of Speed

Running on the Pixel's m68000 CPU, SCM's speed varied inversely with the number of instructions executed. The small caches on more advanced platforms of the day looked ominous for interpreters: 1991-05-01 Benchmarking Scheme.

But 1993 saw widespread deployment of 486 processors integrating 8.kB L1 cache. Through a series of experiments on these and other processors, I learned that interpreter speed is inversely proportional to the size of ceval(); and largely insensitive to everything else. That finding, of course, strictly applies only to the i486 and HP-700 platforms I tested. And my theory explaining this behavior may or may not have been correct then; and may or may not be correct now:

Starting with this empirical result, my theory assumes the processor can execute many instructions in the time it takes to read one cache-line.

If the whole program could fit entirely in the cache and the processor were much faster than the cache, then the only limitation on speed of interpretation would be the movement of data through the cache. Of course, there are some paths through the interpreter which are executed more than others; and many subroutines which are executed much less frequently than the interpreter. So the size of the interpreter is more like a probability distribution than a hard number.

But filling the cache is a bottleneck. Data movement will consume as much bandwidth as is available. The smaller the interpreter, the smaller its footprint in the cache, and the less time wasted re-fetching its code.

The size of the interpreter core is not easily determined from source code or even from object-file sizes. Little used paths through ceval() waste cache space or not depending on how they lie with respect to cache-line boundaries. The era of straightforward yet fast interpreter architectures was over.

Speed improvements of a few percent were found (some accidentally) and integrated into SCM over the years. On 486 platforms, SCM was between 10 and 15 times slower than C. Later processors had similar ratios when compiled with the Gnu C compiler. SCM speed tracked the Linux BogoMIPS measure fairly well.

The last big jump in SCM interpreter performance was Radey Shouman's Ecache: Memory Management for Environments, which netted nearly 5% by reducing the environment stack's cache footprint. After 6 years of tuning, opportunities for speed improvements were diminishing. Testing changes to the SCM interpreter required measuring speed differences of 1% in the presence of quantization and other variations many times larger.

While SCM speed has tracked BogoMIPS, pi.c has surged ahead. When compiled with a recently updated Gnu C compiler it runs faster than before. Perhaps the Gnu C compiler is generating better vectorized instructions. Without optimization, pi.c is now 20 times faster than SCM on Pentium II and Pentium III processors; with optimization it is 50 times faster than SCM.

My new benchmark program generates an array of (SLIB) pseudo-random numbers; then calculates their mean and root-mean-squared variance. Pentium II prng-v.c runs 35 times faster than SCM; 53 times faster when compiled "gcc -O".

From the Intel processor specifications it looks like processor speed has grown several times faster than has cache size. This would tend to increase the interpreter penalty because the C versions of the benchmark programs are small enough to be L1 cache resident.

Fast from the Past

There is a venerable computing paradigm whose SIMD (Single Instruction Multiple Data) primitives natively optimize execution on vector processors without sophisticated analysis (of contrived nested iteration constructs):

APL

Instead of iterating a complicated calculation over each element of an array, an APL interpreter can apply primitive calculations to a succession of (temporary) arrays. For programs operating on regular data structures of more than a few elements, the overhead of interpretation will then be negligible. Although this paradigm uses more memory than serial element replacement, the arrays will usually be compact in the cache.

Radey Shouman has implemented some whole-array arithmetic operations in SCM. I hope the full APL set will eventually be supported by SCM and SLIB. APL computation in Scheme would reduce the interpreter penalty to the point that compilation would be unneeded for many programs.

APL, like Scheme, is a functional language. Unlike Scheme, most of its operators are inherently parallel. Incorporating APL primitives would seem the straighforward way to break Scheme's serial straitjacket.

Copyright © 2002 Aubrey Jaffer

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.  My actions and comments do not reflect in any way on MIT.
Computer Natural Science
agj @ alum.mit.edu
Go Figure!