wild hare http://people.csail.mit.edu/jaffer/CNS/interpreter-branch

Branch Prediction and Interpreter Speed


Abstract

Modern CPUs can execute many instructions in the time it takes to fetch one cache line. Thus, in a well-written Scheme (or other) interpreter having immediate fixnums and boxed bignums and/or flonums, generic operations on those numbers can run as fast as type-restricted arithmetic operations.

To achieve these fast generic arithmetics, branch predictions for the type dispatching code must be set so that speculative fetches are not initiated.

These principles are applied to the SCM Scheme interpreter compiled by GCC, yielding 10% speed improvements on the mix of arithmetic and symbolic operations running the JACAL symbolic mathematics program.

Introduction

Interpreter Speed [1] focuses on program and data movement into the cache as the primary bottleneck for interpreters. In Preliminary Proposal for R6RS Arithmetic [2], discussions about having type-specific duplicates of arithmetic primatives, Alan Watson points out that mispredicted branches have effects beyond the cache in Re: arithmetic issues [3]:
Date: Fri, 21 Oct 2005 19:26:19 -0500
From: Alan Watson
. . .
Aubrey Jaffer wrote:

In well-written interpreted implementations, fixnum operations would provide no speed increase. SCM's type dispatch for immediate numbers is just 3 instructions: mask, test, and conditional jump. With modern processors executing scores of instructions in the time it takes for a single cache line to be retrieved from memory, most data type dispatches use time which would otherwise be spent waiting for program or data caches.

What you write is correct as far as it goes, but imagine having two number types (say fixnums and flonums) and comparing specific operators for each type with a generic operator. On at least some processors, the generic operator will mispredict the branch for one type or the other. Mispredicted branches are often more painful than L1 cache misses. Specific operators can almost always manage to avoid misdirected branches.

Methodology

To explore these issues I performed a systematic series of experiments using: A shell script compiled SCM executables (tscm) and timed each executable computing digits of pi: 2000 digits computed four at a time; and 2000 digits computed sixteen at a time for testing bignums. The script ran four trials with each base (on those executables supporting them). The CPU ("user") times are listed; then their average and RMS deviation.

Results from multiple runs of the suite were not combined because baseline shifts of several percent were observed between suite runs. The final test of each suite compiled and ran the baseline to detect if it shifted during the suite. In the datasets presented here, the durations of the two baseline tests differ less than the RMS deviations.

Baseline

The SCM interpreter always provides immediate-fixnums. With its system of compile-time conditionals, the SCM interpreter can be compiled with any combination of bignum and flonum support (including none).

Without branch-prediction modifications, SCM interpreters providing bignums and/or flonums ran 2% to 3% slower than SCM interpreters providing only immediate-fixnums (see the first entry in each section of Boxed Numerical Datatypes)

The first baseline candidate for integer experiments was:
$ ./build -h system -o tscm -F reckless --compiler-options=-O3
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 143562	   6632	  18512	 168706	  29302	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.019
user	3.036
user	3.018
user	3.046
3.030 +/- 0.012
The option `-F reckless' disables nearly all error checking.

Using the GNU Compiler Collection (GCC) [4] states:

-fno-guess-branch-probability
Do not guess branch probabilities using a randomized model.

Sometimes gcc will opt to use a randomized model to guess branch probabilities, when none are available from either profiling feedback (-fprofile-arcs) or __builtin_expect. This means that different runs of the compiler on the same program may produce different object code.

In a hard real-time system, people don't want different runs of the compiler to produce code that has different behavior; minimizing non-determinism is of paramount import. This switch allows users to reduce non-determinism, possibly at the expense of inferior optimization.

Wanting to remove all possible source of variation, I compiled with -fno-guess-branch-probability:
$ ./build -h system -o tscm -F reckless --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 140298	   6632	  18512	 165442	  28642	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.728
user	2.745
user	2.727
user	2.759
2.740 +/- 0.013
Executables compiled with -fno-guess-branch-probability are repeatably faster than the baseline on the pi benchmark. A floating-point benchmark is also faster with -fno-guess-branch-probability. So this is a better baseline.

Error Checking

Replacing feature `reckless' by `none' enables widespread error checking in "eval.c", "scl.c", and "subr.c". The executable is larger, but is the same speed as `reckless'.
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 154992	   6664	  18512	 180168	  2bfc8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.750
user	2.740
user	2.775
user	2.751
2.754 +/- 0.013
The time difference between the `-F reckless' and `-F none' runs is only 0.014, which is the variance. Being so close, is the default branch prediction correct for the ASRTER and ASRTGO macros enabled by the absence of `-F reckless'?

Replacing
# define ASRTER(_cond, _arg, _pos, _subr) if (!(_cond))wta(_arg, (char *)(_pos), _subr);
# define ASRTGO(_cond, _label) if (!(_cond)) goto _label;
by
# define ASRTER(_cond, _arg, _pos, _subr) if (__builtin_expect(!(_cond),BRANCH_PREDICT)) wta(_arg, (char *)(_pos), _subr);
# define ASRTGO(_cond, _label) if (__builtin_expect(!(_cond),BRANCH_PREDICT)) goto _label;
in scm/scm.h results in identical executable sizes, and time difference less than the variance:
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability --defines=BRANCH_PREDICT 0
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 154992	   6664	  18512	 180168	  2bfc8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.758
user	2.735
user	2.780
user	2.728
2.750 +/- 0.020

But compiling with false predictions:
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability --defines=BRANCH_PREDICT 1
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 154992	   6664	  18512	 180168	  2bfc8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.773
user	2.759
user	2.792
user	2.736
2.765 +/- 0.020
results in the same size executable and essentially the same run time of 2.750.s. So either __builtin_expect is being ignored, or the "pi.scm" benchmark is exercising ASRTER and ASRTGO macros only infrequently.

$ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=BRANCH_PREDICT 0
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 159480	   6664	  18512	 184656	  2d150	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.036
user	2.995
user	3.020
user	3.004
3.014 +/- 0.016

$ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=BRANCH_PREDICT 1
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 158592	   6664	  18512	 183768	  2cdd8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.040
user	3.089
user	3.045
user	3.056
3.058 +/- 0.019
Removing the `-fno-guess-branch-probability' compiler option finds that `BRANCH_PREDICT' makes a difference in both executable size and speed. Comparing the assembly language files generated by `gcc -S' reveals that __builtin_expect is being ignored when `-fno-guess-branch-probability' is used.

This certainly is not the behavior described by the documentation quoted above. A search of the gcc-bug archives does not find mention of this issue.

Not being able to work from a known state of branch defaults makes it difficult to tune branch predictions.

As will be shown later, compiling with `-fno-guess-branch-probability' yields most of the speed increase possible. The arithmetic code in scm/scl.c is written so that the fixnum-only case is at the end of each function. Thus if branches were always predicted to fail, we would expect fixnum arithmetic speed to be little effected by compiling in bignums and flonums. Always predicting failure would also be the correct action for the ASRTER and ASRTGO macros. It appears to be the case that `gcc -fno-guess-branch-probability' predicts all branches false.

Boxed Numerical Datatypes

With correct branch predictions, compiling in bignum or flonum support should not appreciably slow the fixnum benchmark. For boxed bignums and flonums, the time to retrieve the cache-line containing the data will dominate those operations (in the aggregate).

SCM's bignums and flonums are both boxed types (versus immediate fixnums). The data fetch for a predicted bignum will be from the same address as for a flonum; there should be no penalty choosing to predict a bignum versus a flonum, although it may be difficult to convince the compiler of that.

The `IMP' predicate (IMmediate-Predicate) returns true for immediate operands; it is invoked by many other predicates. If GCC propagates branch predictions through logical connectives, then it is precisely the right prediction -- always expect immediates. Expectation is also set for `INUMP' (Immediate-NUMber-Predicate).
#define IMP(x) (6 & (int)(x))
#define INUMP(x) (2 & (int)(x))
in scm/scm.h changes to:
#define IMP(x) SCM_EXPECT_TRUE(6 & (int)(x))
#define INUMP(x) SCM_EXPECT_TRUE(2 & (int)(x))
This patch is using the final form of branch hints: `SCM_EXPECT_TRUE' and `SCM_EXPECT_FALSE. Also shown are some other branch hinting macros.
#define SCM_EXPECT_TRUE(expr) (__builtin_expect(expr, !0))
#define SCM_EXPECT_FALSE(expr) (__builtin_expect(expr, 0))
#define POSFIXABLE(n) SCM_EXPECT_TRUE((n) <= MOST_POSITIVE_FIXNUM)
#define NEGFIXABLE(n) SCM_EXPECT_TRUE((n) >= MOST_NEGATIVE_FIXNUM)
#define UNEGFIXABLE(n) SCM_EXPECT_TRUE((n) <= -MOST_NEGATIVE_FIXNUM)

IMP-predictions turned out to be the key; the improvement from the them was larger than that from `-fno-guess-branch-probability'. Reaching performance better than the baseline makes possible the evaluation of incremental improvements resulting from other changes.

Adding SCM_EXPECT_FALSE to `NIMP' (Not-IMmediate-Predicate) brings no further improvement; evidence that predictions are correctly propagated.

The script measures both (pi 2000 4) and (pi 2000 16). The latter test exercises bignum arithmetic.

Each configuration of `-F' options is built (and run) three ways:

--compiler-options="-O3" --defines"=__builtin_expect(expr,expected) (expr)"
Disables the branch prediction macro. Branches predictions are made "randomly" by GCC.
--compiler-options="-O3 -fno-guess-branch-probability"
GCC does not guess branch probabilities using a randomized model; also disables branch prediction macro.
--compiler-options="-O3"
Makes the branch predictions listed above. Unconstrained branches predictions are made "randomly" by GCC.

Fixnums

$ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 158772	   6664	  18512	 183948	  2ce8c	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.077
user	3.065
user	3.067
user	3.050
3.065 +/- 0.010

$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 154992	   6664	  18512	 180168	  2bfc8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.848
user	2.843
user	2.825
user	2.821
2.834 +/- 0.011

$ ./build -h system -o tscm -F none --compiler-options=-O3
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 159312	   6664	  18512	 184488	  2d0a8	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.844
user	2.858
user	2.863
user	2.873
2.860 +/- 0.010
A 7% speed improvement.

Bignums and Fixnums

$ ./build -h system -o tscm -F bignums --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 172330	   6696	  18512	 197538	  303a2	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.162
user	3.159
user	3.170
user	3.191
3.170 +/- 0.012

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	2.006
user	1.971
user	1.996
user	1.989
1.990 +/- 0.013

$ ./build -h system -o tscm -F bignums --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 168122	   6696	  18512	 193330	  2f332	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.896
user	2.843
user	2.857
user	2.863
2.865 +/- 0.019

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	1.871
user	1.873
user	1.874
user	1.866
1.871 +/- 0.003

$ ./build -h system -o tscm -F bignums --compiler-options=-O3
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 172802	   6696	  18512	 198010	  3057a	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.747
user	2.705
user	2.711
user	2.720
2.721 +/- 0.016

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	2.419
user	2.383
user	2.408
user	2.387
2.399 +/- 0.015
Fixnum speed improvements of 9% and 14%.

Flonums and Fixnums

$ ./build -h system -o tscm -F inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 175848	   7196	  18544	 201588	  31374	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.156
user	3.150
user	3.145
user	3.142
3.148 +/- 0.005

$ ./build -h system -o tscm -F inexact --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 171672	   7196	  18544	 197412	  30324	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.751
user	2.717
user	2.716
user	2.715
2.725 +/- 0.015

$ ./build -h system -o tscm -F inexact --compiler-options=-O3
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 176456	   7196	  18544	 202196	  315d4	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.703
user	2.663
user	2.673
user	2.673
2.678 +/- 0.015
Speed improvements of 13% and 15%.

Flonums, Bignums, and Fixnums

$ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 193178	   7292	  18544	 219014	  35786	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	3.110
user	3.149
user	3.151
user	3.144
3.138 +/- 0.017

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	2.004
user	1.988
user	1.986
user	1.986
1.991 +/- 0.008

$ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 188534	   7292	  18544	 214370	  34562	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.752
user	2.737
user	2.754
user	2.757
2.750 +/- 0.008

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	1.935
user	1.880
user	1.867
user	1.885
1.892 +/- 0.026

$ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3
$ size ./tscm
   text	   data	    bss	    dec	    hex	filename
 193690	   7292	  18544	 219526	  35986	./tscm
$ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user
user	2.994
user	2.995
user	3.009
user	2.999
2.999 +/- 0.006

$ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user
user	1.979
user	1.984
user	1.974
user	1.978
1.979 +/- 0.004
IMP-predictions improve fixnum speed by 4%. The -fno-guess-branch-probability is better at 12%.

The bignum-and-fixnum and flonum-and-fixnum builds run faster than the fixnum-only build, establishing that arithmetic operations restricted to fixnums will run no faster than the generic operations.

Flonums Versus Bignums

Although fixnums should be predicted over bignums and flonums, does the choice between bignums and flonums mean that an interpreter supporting both will be be suboptimal for one?

I ran tests computing forward and backward 8192-point (SLIB) FFTs on random data; then calculating the Euclidean distance between the original and round-trip 8192-element vectors.

Flonums and Fixnums

$ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 175848	   7196	  18544	 201588	  31374	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	6.597
user	6.588
user	6.576
user	6.543
6.576 +/- 0.020

$ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3 -fno-guess-branch-probability
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 171672	   7196	  18544	 197412	  30324	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	5.785
user	5.786
user	5.802
user	5.781
5.788 +/- 0.008

$ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 176456	   7196	  18544	 202196	  315d4	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	5.674
user	5.714
user	5.738
user	5.679
5.701 +/- 0.026

Flonums, Bignums, and Fixnums

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 193178	   7292	  18544	 219014	  35786	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	6.766
user	6.750
user	6.747
user	6.671
6.733 +/- 0.037

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 188534	   7292	  18544	 214370	  34562	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	5.708
user	5.656
user	5.710
user	5.698
5.693 +/- 0.022

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 193690	   7292	  18544	 219526	  35986	/home/jaffer/scm/tscm
$ /home/jaffer/scm/tscm -l fft.scm >/dev/null
user	5.851
user	5.895
user	5.824
user	5.869
5.860 +/- 0.026

Doing bignum or flonum calculations when all numeric support is compiled in, the -fno-guess-branch-probability build is the fastest; and it runs no slower (within the RMS variance) than builds with only bignums or only flonums compiled in. Thus it appears that the cost of mispredicted branches is much less than the cost of mispredicted fetches on Pentium 4 CPUs. This supports the assertion in "Boxed Numerical Datatypes":
The data fetch for a predicted bignum will be from the same address as for a flonum; there should be no penalty choosing to predict a bignum versus a flonum ...

Real Programs

Here are the times for the SCM interpreter running JACAL through its regression suite. In addition to the three configurations tested before, a fourth is added:
-F bignums --compiler-options="-O3 -fbranch-probabilities" --defines="__builtin_expect(expr,expected) (expr)"
Disables the branch prediction macro. Branches predictions are made from profiling a jacal regression run.

Bignums and Fixnums

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 172330	   6696	  18512	 197538	  303a2	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	4.297
user	4.314
user	4.295
user	4.302
4.302 +/- 0.007

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 -fno-guess-branch-probability
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 168122	   6696	  18512	 193330	  2f332	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.870
user	3.877
user	3.858
user	3.868
3.868 +/- 0.007

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 172802	   6696	  18512	 198010	  3057a	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.796
user	3.742
user	3.792
user	3.783
3.778 +/- 0.021

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 -fbranch-probabilities --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 171569	   6696	  18512	 196777	  300a9	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.773
user	3.792
user	3.739
user	3.795
3.775 +/- 0.022
Speed improvements of 10%, 12%, and 12%.

Flonums, Bignums, and Fixnums

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 193178	   7292	  18544	 219014	  35786	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	4.222
user	4.201
user	4.182
user	4.199
4.201 +/- 0.014

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 188534	   7292	  18544	 214370	  34562	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.781
user	3.795
user	3.794
user	3.802
3.793 +/- 0.008

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 193690	   7292	  18544	 219526	  35986	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.796
user	3.766
user	3.741
user	3.800
3.776 +/- 0.024

$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fbranch-probabilities --defines=__builtin_expect(expr,expected) (expr)
$ size /home/jaffer/scm/tscm
   text	   data	    bss	    dec	    hex	filename
 192331	   7292	  18544	 218167	  35437	/home/jaffer/scm/tscm
$ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user
user	3.728
user	3.748
user	3.706
user	3.724
3.726 +/- 0.015
Speed improvements of 9%, 10%, and 11%.

Branch predictions are definitely worth using, netting speed improvements of 10% on uncontrived programs. The `-fbranch-probabilities' builds are matched to the timed run. They minimize the number of mispredicted branches, but don't take account of the differences in the time impact of individual branches. The IMP-predictions are based, not on the number of missed branches, but on the cost of branches. Thus it may be possible to best the performance of `-fbranch-probabilities'. The `-fno-guess-branch-probability' foiling `__builtin_expect' is an obstacle to progress in this area.

GCC-4.0.0 doesn't suffer from this bug. JACAL runs 1.6% faster with `-fno-guess-branch-probability' and `__builtin_expect' together than with just `__builtin_expect'. Unfortunately, the `-fbranch-probabilities' build gets link errors; so the question remains unanswered.

Conclusion

For an interpreter, using branch prediction to prevent speculative fetches can make generic arithmetic operations as fast as typed-restricted ones.

GCC compiling the SCM interpreter with IMP-predictions results in 10% speed improvement running JACAL through its regression suite. These predictions are being incorporated into SCM, conditioned on `__GNUC__'. The Guile interpreter, which is based on an early version of SCM, might also benefit from IMP-predictions or the use of `-O3 -fno-guess-branch-probability'

GCC compiling the SCM interpreter with `-O3 -fno-guess-branch-probability' results in 12% speed improvement calculating digits of pi using fixnums; the improvement is 5% using bignums. JACAL (using the SCM interpreter) running its regression suite improves 10%.

This flies in the face of GCC's documentation stating "This switch allows users to reduce non-determinism, possibly at the expense of inferior optimization." The `-fno-guess-branch-probability' flag significantly improves performance of the SCM interpreter on numerical and symbolic tasks. This raises the question: which program(s) do the GCC developers evaluate optimization on?

The behavior of experiments is consistent with `gcc -fno-guess-branch-probability' predicting all branches false. In order to allow programmers to use its advanced features, the GCC documentation should be more informative about how branch prediction is determined.

Bibliography

[1]
Jaffer, Aubrey,
Interpreter Speed,
http://people.csail.mit.edu/jaffer/CNS/interpreter-speed,
2002.
[2]
Clinger, William D and Sperber, Michael,
Preliminary Proposal for R6RS Arithmetic,
http://srfi.schemers.org/srfi-77/srfi-77.html,
srfi.schemers.org, 2005.
[3]
Watson, Alan,
Re: arithmetic issues,
http://srfi.schemers.org/srfi-77/mail-archive/msg00091.html,
srfi.schemers.org, 2005.
[4]
Free Software Foundation,
Using the GNU Compiler Collection (GCC),
http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Optimize-Options.html,
Free Software Foundation, Boston, MA, 2002.

Copyright © 2006 Aubrey Jaffer

I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.  My actions and comments do not reflect in any way on MIT.
Computer Natural Science
agj @ alum.mit.edu
Go Figure!