http://people.csail.mit.edu/jaffer/CNS/interpreter-branch | |
Branch Prediction and Interpreter Speed |
To achieve these fast generic arithmetics, branch predictions for the type dispatching code must be set so that speculative fetches are not initiated.
These principles are applied to the SCM Scheme interpreter compiled by GCC, yielding 10% speed improvements on the mix of arithmetic and symbolic operations running the JACAL symbolic mathematics program.
Date: Fri, 21 Oct 2005 19:26:19 -0500 From: Alan Watson . . . Aubrey Jaffer wrote:What you write is correct as far as it goes, but imagine having two number types (say fixnums and flonums) and comparing specific operators for each type with a generic operator. On at least some processors, the generic operator will mispredict the branch for one type or the other. Mispredicted branches are often more painful than L1 cache misses. Specific operators can almost always manage to avoid misdirected branches. |
Results from multiple runs of the suite were not combined because baseline shifts of several percent were observed between suite runs. The final test of each suite compiled and ran the baseline to detect if it shifted during the suite. In the datasets presented here, the durations of the two baseline tests differ less than the RMS deviations.
Without branch-prediction modifications, SCM interpreters providing bignums and/or flonums ran 2% to 3% slower than SCM interpreters providing only immediate-fixnums (see the first entry in each section of Boxed Numerical Datatypes)
The first baseline candidate for integer experiments was:
$ ./build -h system -o tscm -F reckless --compiler-options=-O3 $ size ./tscm text data bss dec hex filename 143562 6632 18512 168706 29302 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.019 user 3.036 user 3.018 user 3.046 3.030 +/- 0.012 |
Using the GNU Compiler Collection (GCC) [4] states:
-fno-guess-branch-probability
Sometimes gcc will opt to use a randomized model to guess branch
probabilities, when none are available from either profiling feedback
(-fprofile-arcs
) or __builtin_expect
. This means that
different runs of the compiler on the same program may produce different
object code.
In a hard real-time system, people don't want different runs of the compiler to produce code that has different behavior; minimizing non-determinism is of paramount import. This switch allows users to reduce non-determinism, possibly at the expense of inferior optimization.
-fno-guess-branch-probability
:
$ ./build -h system -o tscm -F reckless --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 140298 6632 18512 165442 28642 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.728 user 2.745 user 2.727 user 2.759 2.740 +/- 0.013 |
-fno-guess-branch-probability
are repeatably faster than the baseline on the pi benchmark. A
floating-point benchmark is also faster with
-fno-guess-branch-probability
. So this is a better
baseline.
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 154992 6664 18512 180168 2bfc8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.750 user 2.740 user 2.775 user 2.751 2.754 +/- 0.013 |
Replacing
# define ASRTER(_cond, _arg, _pos, _subr) if (!(_cond))wta(_arg, (char *)(_pos), _subr); # define ASRTGO(_cond, _label) if (!(_cond)) goto _label; |
# define ASRTER(_cond, _arg, _pos, _subr) if (__builtin_expect(!(_cond),BRANCH_PREDICT)) wta(_arg, (char *)(_pos), _subr); # define ASRTGO(_cond, _label) if (__builtin_expect(!(_cond),BRANCH_PREDICT)) goto _label; |
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability --defines=BRANCH_PREDICT 0 $ size ./tscm text data bss dec hex filename 154992 6664 18512 180168 2bfc8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.758 user 2.735 user 2.780 user 2.728 2.750 +/- 0.020 |
But compiling with false predictions:
$ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability --defines=BRANCH_PREDICT 1 $ size ./tscm text data bss dec hex filename 154992 6664 18512 180168 2bfc8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.773 user 2.759 user 2.792 user 2.736 2.765 +/- 0.020 |
__builtin_expect
is being ignored,
or the "pi.scm" benchmark is exercising ASRTER and ASRTGO macros only
infrequently.
$ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=BRANCH_PREDICT 0 $ size ./tscm text data bss dec hex filename 159480 6664 18512 184656 2d150 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.036 user 2.995 user 3.020 user 3.004 3.014 +/- 0.016 $ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=BRANCH_PREDICT 1 $ size ./tscm text data bss dec hex filename 158592 6664 18512 183768 2cdd8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.040 user 3.089 user 3.045 user 3.056 3.058 +/- 0.019 |
__builtin_expect
is being ignored when
`-fno-guess-branch-probability' is used.
This certainly is not the behavior described by the documentation quoted above. A search of the gcc-bug archives does not find mention of this issue.
Not being able to work from a known state of branch defaults makes it difficult to tune branch predictions.
As will be shown later, compiling with `-fno-guess-branch-probability' yields most of the speed increase possible. The arithmetic code in scm/scl.c is written so that the fixnum-only case is at the end of each function. Thus if branches were always predicted to fail, we would expect fixnum arithmetic speed to be little effected by compiling in bignums and flonums. Always predicting failure would also be the correct action for the ASRTER and ASRTGO macros. It appears to be the case that `gcc -fno-guess-branch-probability' predicts all branches false.
SCM's bignums and flonums are both boxed types (versus immediate fixnums). The data fetch for a predicted bignum will be from the same address as for a flonum; there should be no penalty choosing to predict a bignum versus a flonum, although it may be difficult to convince the compiler of that.
The `IMP' predicate (IMmediate-Predicate) returns true for immediate operands; it is invoked by many other predicates. If GCC propagates branch predictions through logical connectives, then it is precisely the right prediction -- always expect immediates. Expectation is also set for `INUMP' (Immediate-NUMber-Predicate).
#define IMP(x) (6 & (int)(x)) #define INUMP(x) (2 & (int)(x)) |
#define IMP(x) SCM_EXPECT_TRUE(6 & (int)(x)) #define INUMP(x) SCM_EXPECT_TRUE(2 & (int)(x)) |
#define SCM_EXPECT_TRUE(expr) (__builtin_expect(expr, !0)) #define SCM_EXPECT_FALSE(expr) (__builtin_expect(expr, 0)) #define POSFIXABLE(n) SCM_EXPECT_TRUE((n) <= MOST_POSITIVE_FIXNUM) #define NEGFIXABLE(n) SCM_EXPECT_TRUE((n) >= MOST_NEGATIVE_FIXNUM) #define UNEGFIXABLE(n) SCM_EXPECT_TRUE((n) <= -MOST_NEGATIVE_FIXNUM) |
IMP-predictions turned out to be the key; the improvement from the them was larger than that from `-fno-guess-branch-probability'. Reaching performance better than the baseline makes possible the evaluation of incremental improvements resulting from other changes.
Adding SCM_EXPECT_FALSE
to `NIMP'
(Not-IMmediate-Predicate) brings no further improvement; evidence that
predictions are correctly propagated.
The script measures both (pi 2000 4)
and
(pi 2000 16)
. The latter test exercises bignum
arithmetic.
Each configuration of `-F' options is built (and run) three ways:
--compiler-options="-O3" --defines"=__builtin_expect(expr,expected) (expr)"
--compiler-options="-O3 -fno-guess-branch-probability"
--compiler-options="-O3"
$ ./build -h system -o tscm -F none --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size ./tscm text data bss dec hex filename 158772 6664 18512 183948 2ce8c ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.077 user 3.065 user 3.067 user 3.050 3.065 +/- 0.010 $ ./build -h system -o tscm -F none --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 154992 6664 18512 180168 2bfc8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.848 user 2.843 user 2.825 user 2.821 2.834 +/- 0.011 $ ./build -h system -o tscm -F none --compiler-options=-O3 $ size ./tscm text data bss dec hex filename 159312 6664 18512 184488 2d0a8 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.844 user 2.858 user 2.863 user 2.873 2.860 +/- 0.010 |
$ ./build -h system -o tscm -F bignums --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size ./tscm text data bss dec hex filename 172330 6696 18512 197538 303a2 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.162 user 3.159 user 3.170 user 3.191 3.170 +/- 0.012 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 2.006 user 1.971 user 1.996 user 1.989 1.990 +/- 0.013 $ ./build -h system -o tscm -F bignums --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 168122 6696 18512 193330 2f332 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.896 user 2.843 user 2.857 user 2.863 2.865 +/- 0.019 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 1.871 user 1.873 user 1.874 user 1.866 1.871 +/- 0.003 $ ./build -h system -o tscm -F bignums --compiler-options=-O3 $ size ./tscm text data bss dec hex filename 172802 6696 18512 198010 3057a ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.747 user 2.705 user 2.711 user 2.720 2.721 +/- 0.016 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 2.419 user 2.383 user 2.408 user 2.387 2.399 +/- 0.015 |
$ ./build -h system -o tscm -F inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size ./tscm text data bss dec hex filename 175848 7196 18544 201588 31374 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.156 user 3.150 user 3.145 user 3.142 3.148 +/- 0.005 $ ./build -h system -o tscm -F inexact --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 171672 7196 18544 197412 30324 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.751 user 2.717 user 2.716 user 2.715 2.725 +/- 0.015 $ ./build -h system -o tscm -F inexact --compiler-options=-O3 $ size ./tscm text data bss dec hex filename 176456 7196 18544 202196 315d4 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.703 user 2.663 user 2.673 user 2.673 2.678 +/- 0.015 |
$ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size ./tscm text data bss dec hex filename 193178 7292 18544 219014 35786 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 3.110 user 3.149 user 3.151 user 3.144 3.138 +/- 0.017 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 2.004 user 1.988 user 1.986 user 1.986 1.991 +/- 0.008 $ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability $ size ./tscm text data bss dec hex filename 188534 7292 18544 214370 34562 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.752 user 2.737 user 2.754 user 2.757 2.750 +/- 0.008 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 1.935 user 1.880 user 1.867 user 1.885 1.892 +/- 0.026 $ ./build -h system -o tscm -F bignums inexact --compiler-options=-O3 $ size ./tscm text data bss dec hex filename 193690 7292 18544 219526 35986 ./tscm $ (time ./tscm -l pi.scm -e"(pi 2000 4)" > /dev/null) | grep user user 2.994 user 2.995 user 3.009 user 2.999 2.999 +/- 0.006 $ (time ./tscm -l pi.scm -e"(pi 2000 16)" > /dev/null) | grep user user 1.979 user 1.984 user 1.974 user 1.978 1.979 +/- 0.004 |
-fno-guess-branch-probability
is better at 12%.
The bignum-and-fixnum and flonum-and-fixnum builds run faster than the fixnum-only build, establishing that arithmetic operations restricted to fixnums will run no faster than the generic operations.
I ran tests computing forward and backward 8192-point (SLIB) FFTs on random data; then calculating the Euclidean distance between the original and round-trip 8192-element vectors.
$ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 175848 7196 18544 201588 31374 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 6.597 user 6.588 user 6.576 user 6.543 6.576 +/- 0.020 $ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3 -fno-guess-branch-probability $ size /home/jaffer/scm/tscm text data bss dec hex filename 171672 7196 18544 197412 30324 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 5.785 user 5.786 user 5.802 user 5.781 5.788 +/- 0.008 $ ./build -h system -o /home/jaffer/scm/tscm -F inexact --compiler-options=-O3 $ size /home/jaffer/scm/tscm text data bss dec hex filename 176456 7196 18544 202196 315d4 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 5.674 user 5.714 user 5.738 user 5.679 5.701 +/- 0.026 |
$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 193178 7292 18544 219014 35786 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 6.766 user 6.750 user 6.747 user 6.671 6.733 +/- 0.037 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability $ size /home/jaffer/scm/tscm text data bss dec hex filename 188534 7292 18544 214370 34562 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 5.708 user 5.656 user 5.710 user 5.698 5.693 +/- 0.022 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 $ size /home/jaffer/scm/tscm text data bss dec hex filename 193690 7292 18544 219526 35986 /home/jaffer/scm/tscm $ /home/jaffer/scm/tscm -l fft.scm >/dev/null user 5.851 user 5.895 user 5.824 user 5.869 5.860 +/- 0.026 |
The data fetch for a predicted bignum will be from the same address as for a flonum; there should be no penalty choosing to predict a bignum versus a flonum ...
-F bignums --compiler-options="-O3 -fbranch-probabilities" --defines="__builtin_expect(expr,expected) (expr)"
$ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 172330 6696 18512 197538 303a2 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 4.297 user 4.314 user 4.295 user 4.302 4.302 +/- 0.007 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 -fno-guess-branch-probability $ size /home/jaffer/scm/tscm text data bss dec hex filename 168122 6696 18512 193330 2f332 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.870 user 3.877 user 3.858 user 3.868 3.868 +/- 0.007 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 $ size /home/jaffer/scm/tscm text data bss dec hex filename 172802 6696 18512 198010 3057a /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.796 user 3.742 user 3.792 user 3.783 3.778 +/- 0.021 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums --compiler-options=-O3 -fbranch-probabilities --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 171569 6696 18512 196777 300a9 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.773 user 3.792 user 3.739 user 3.795 3.775 +/- 0.022 |
$ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 193178 7292 18544 219014 35786 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 4.222 user 4.201 user 4.182 user 4.199 4.201 +/- 0.014 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fno-guess-branch-probability $ size /home/jaffer/scm/tscm text data bss dec hex filename 188534 7292 18544 214370 34562 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.781 user 3.795 user 3.794 user 3.802 3.793 +/- 0.008 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 $ size /home/jaffer/scm/tscm text data bss dec hex filename 193690 7292 18544 219526 35986 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.796 user 3.766 user 3.741 user 3.800 3.776 +/- 0.024 $ ./build -h system -o /home/jaffer/scm/tscm -F bignums inexact --compiler-options=-O3 -fbranch-probabilities --defines=__builtin_expect(expr,expected) (expr) $ size /home/jaffer/scm/tscm text data bss dec hex filename 192331 7292 18544 218167 35437 /home/jaffer/scm/tscm $ time jacal /home/jaffer/scm/tscm <test.math >/tmp/jacal.output | grep user user 3.728 user 3.748 user 3.706 user 3.724 3.726 +/- 0.015 |
Branch predictions are definitely worth using, netting speed improvements of 10% on uncontrived programs. The `-fbranch-probabilities' builds are matched to the timed run. They minimize the number of mispredicted branches, but don't take account of the differences in the time impact of individual branches. The IMP-predictions are based, not on the number of missed branches, but on the cost of branches. Thus it may be possible to best the performance of `-fbranch-probabilities'. The `-fno-guess-branch-probability' foiling `__builtin_expect' is an obstacle to progress in this area.
GCC-4.0.0 doesn't suffer from this bug. JACAL runs 1.6% faster with `-fno-guess-branch-probability' and `__builtin_expect' together than with just `__builtin_expect'. Unfortunately, the `-fbranch-probabilities' build gets link errors; so the question remains unanswered.
GCC compiling the SCM interpreter with IMP-predictions results in 10% speed improvement running JACAL through its regression suite. These predictions are being incorporated into SCM, conditioned on `__GNUC__'. The Guile interpreter, which is based on an early version of SCM, might also benefit from IMP-predictions or the use of `-O3 -fno-guess-branch-probability'
GCC compiling the SCM interpreter with `-O3 -fno-guess-branch-probability' results in 12% speed improvement calculating digits of pi using fixnums; the improvement is 5% using bignums. JACAL (using the SCM interpreter) running its regression suite improves 10%.
This flies in the face of GCC's documentation stating "This switch allows users to reduce non-determinism, possibly at the expense of inferior optimization." The `-fno-guess-branch-probability' flag significantly improves performance of the SCM interpreter on numerical and symbolic tasks. This raises the question: which program(s) do the GCC developers evaluate optimization on?
The behavior of experiments is consistent with `gcc -fno-guess-branch-probability' predicting all branches false. In order to allow programmers to use its advanced features, the GCC documentation should be more informative about how branch prediction is determined.
Copyright © 2006 Aubrey Jaffer
I am a guest and not a member of the MIT Computer Science and Artificial Intelligence Laboratory.
My actions and comments do not reflect in any way on
MIT. | ||
Computer Natural Science | ||
agj @ alum.mit.edu | Go Figure! |