Relating Search Depth to the Strength of Chess Programs
To the best of our knowledge, Gillogly and Newborn in 1978 independently
reported the earliest attempts at modeling the relationship between the
playing strength of chess programs on one hand and the available
computing power or search depth on the other. Gillogly introduced his
so-called ``technology curve'' [75] that plotted the
playing strength against what he called ``machine power'' on a
logarithmic scale. Newborn related the numbers of nodes as searched by
different chess programs in three minutes (average time per move in
tournament games) to the playing strengths of the very same programs as
derived from their actual performances in
tournaments [164,165]. Later on,
Levy [138] and Levy and
Newborn [131] refined Newborn's initial scheme by
contrasting the highest rated tournament performances of the best chess
programs with the years of their achievement. All these comparisons
inevitably led to speculative extrapolations of the graphs which Levy
characterized to the point as the ``meta-science of prediction in
computer chess'' in his latest article about the subject in
1997 [130].
Self-play matches as pioneered by Thompson with his chess machine
BELLE in 1982 [206] represent a more rigorous
method of investigating the relationship between available computing
power or search depth and the playing strength of chess programs. A
great advantage of such matches is that the encountered winning
percentages quantify the differences in playing strength of the various
participating versions of the same program. Despite lingering doubts and
questions regarding the magnitude of self-play rating
differences [28], many researchers felt and still
feel that self-play is the best of the available approaches in order to
assess the expected ``diminishing returns'' of additional search in
chess. In the context of self-play matches the presence of diminishing
returns should lead to notably reduced winning percentages of the deeper
or longer searching program versions with the progression towards higher
search depths and search times.
Up to now nobody proved the existence of diminishing returns for
self-play in computer chess by means of statistically significant
experiments. For matches of only 20 games, even high winning percentages
of 70%-75% barely suffice to decide with good confidence that
the victor is indeed stronger than his opponent in general.
Consequently, such small sample sizes do not allow for any confident
quantification of the absolute rating differences between the opponents.
In order to confidently assess rating differences of 70 points we need
about 500 games per match and for differences of 40-50 rating
points the number of necessary games per match increases to 1000. Hence,
we completely agree with Mysliwietz [158] who already
criticized the statistical uncertainty of many famous self-play
experiments in computer chess back in 1994. Based on these explanations
it is now easy to understand why we label many experimental results as
``not statistically significant'' while describing all prominent
self-play experiments in computer chess below.
- 1982: Thompson [206].
- Thompson's pioneering experiment featured 100 self-play games with
matches of 20 games each between versions of BELLE differing by
exactly one ply in lookahead for fixed search depths of 3-8
plies. The gain in playing strength averaged at 246 rating points per
ply of search. The experiment showed no diminishing returns at any
search depth but its results are not statistically significant anyway.
- 1983: Condon and Thompson [57].
- In the second experiment, Condon and Thompson let BELLE self-play
300 games in round-robin style with matches of 20 games each between all
program versions for fixed search depths of 3-9 plies. The gain
in playing strength averaged at 217 rating points per ply of search. Now
the observed ratings hinted at limited diminishing returns from a
fixed search depth of 6 plies onwards. Yet the results of the
experiment are not statistically significant.
- 1988: Szabo and Szabo [198].
- The two Szabos determined the technology curve of their chess program
TECHMATE that self-played 6882 games on two Atari ST computers
linked together via their MIDI ports. The number of games per match
between longer and shorter searching versions of the program varied
strongly from a minimum of 32 to a maximum of 1367. The gain in playing
strength averaged at 156 rating points per doubling of available search
time (computing power). The experimental data indicated slight
diminishing returns at longer search times. Several results from the
experiment are statistically significant. Unfortunately, however, this
does not hold for the results of the most interesting program versions
with really long search times. They simply did not play enough games to
draw reliable conclusions.
- 1990: Berliner et al. [28].
- The HITECH team made their chess machine self-play 1056 games in a
round-robin setting with matches of 16 games each between all program
versions of HITECH and LOTECH (a variant of HITECH scaled
down knowledge-wise) for fixed search depths of 4-9 plies. The
gain in playing strength averaged at 195 rating points per ply of search
for HITECH and at 232 rating points per ply for LOTECH. The
overall ratings showed signs of limited diminishing returns starting at
a fixed search depth of 6 plies. But there was no clear trend of
diminishing returns at higher search depths and the experimental results
are not statistically significant.
- 1994: Mysliwietz [158].
- In his own experiment, Mysliwietz let the parallel chess program
ZUGZWANG self-play 450 games with 50 games per match between
program versions that differed roughly by a factor of 2 in search speed
due to varying numbers of allotted processors. The gain in playing
strength averaged at 109 rating points per doubling of search speed for
9 successive doubling steps. The observed ratings do not exhibit any
diminishing returns at all. Most of the match results from this
experiment are statistically significant in the sense that they allow
for the discrimination of stronger and weaker opponents with 95%
confidence. The true general rating gain for the version of ZUGZWANG
used in the experiment lay in the estimated range of 76-143
points per ply of search with 95% confidence. Thence, Mysliwietz's
experiment does neither support nor falsify the hypothesis of
diminishing returns for self-play of ZUGZWANG with good confidence.
- 1997: Junghanns et al. [121].
- The self-play experiment with Junghanns' chess program THE TURK
featured 480 games with matches of 80 games each between program
versions differing by exactly one ply in lookahead for fixed search
depths of 3-9 plies. The gain in playing strength averaged
around 200 rating points per ply of search.1.8 The winning percentages of the
deeper searching versions of THE TURK actually increased steadily from
fixed search depths of 6 plies onwards, thus even hinting at additional
gains in returns for higher search depths rather than diminishing ones.
The match results from this experiment are statistically significant in
the sense that they allow for the discrimination of stronger and weaker
opponents with 95% confidence. As in the case of Mysliwietz's
experiment, however, the resulting data of the THE TURK neither
supports nor falsifies the hypothesis of diminishing returns for
self-play of this program with good confidence.
In their article Junghanns et al. [121] then
continued to look for diminishing returns by means of other metrics than
self-play. They finally claimed to have found empirical evidence in this
respect. According to their explanations the low search quality of
chess programs (i.e. their high error probability) and the abnormally
large lengths of self-play games inadvertently hide the doubtlessly
existent diminishing returns in computer chess. Although we greatly
appreciate Junghanns et al.'s new trial aimed at the better understanding
of diminishing returns in computer chess, we are not convinced that
their claims hold when subjected to rigorous methodological and
statistical testing. In our opinion the quest for indisputable and
statistically significant demonstrations of diminishing returns for
additional search in computer chess still remains to be concluded.
Created by Ernst A. Heinz, Thu Dec 16 23:28:11 EST 1999