Relating Search Depth to the Strength of Chess Programs

To the best of our knowledge, Gillogly and Newborn in 1978 independently reported the earliest attempts at modeling the relationship between the playing strength of chess programs on one hand and the available computing power or search depth on the other. Gillogly introduced his so-called ``technology curve'' [75] that plotted the playing strength against what he called ``machine power'' on a logarithmic scale. Newborn related the numbers of nodes as searched by different chess programs in three minutes (average time per move in tournament games) to the playing strengths of the very same programs as derived from their actual performances in tournaments [164,165]. Later on, Levy [138] and Levy and Newborn [131] refined Newborn's initial scheme by contrasting the highest rated tournament performances of the best chess programs with the years of their achievement. All these comparisons inevitably led to speculative extrapolations of the graphs which Levy characterized to the point as the ``meta-science of prediction in computer chess'' in his latest article about the subject in 1997 [130].

Self-play matches as pioneered by Thompson with his chess machine BELLE in 1982 [206] represent a more rigorous method of investigating the relationship between available computing power or search depth and the playing strength of chess programs. A great advantage of such matches is that the encountered winning percentages quantify the differences in playing strength of the various participating versions of the same program. Despite lingering doubts and questions regarding the magnitude of self-play rating differences [28], many researchers felt and still feel that self-play is the best of the available approaches in order to assess the expected ``diminishing returns'' of additional search in chess. In the context of self-play matches the presence of diminishing returns should lead to notably reduced winning percentages of the deeper or longer searching program versions with the progression towards higher search depths and search times.

Up to now nobody proved the existence of diminishing returns for self-play in computer chess by means of statistically significant experiments. For matches of only 20 games, even high winning percentages of 70%-75% barely suffice to decide with good confidence that the victor is indeed stronger than his opponent in general. Consequently, such small sample sizes do not allow for any confident quantification of the absolute rating differences between the opponents. In order to confidently assess rating differences of 70 points we need about 500 games per match and for differences of 40-50 rating points the number of necessary games per match increases to 1000. Hence, we completely agree with Mysliwietz [158] who already criticized the statistical uncertainty of many famous self-play experiments in computer chess back in 1994. Based on these explanations it is now easy to understand why we label many experimental results as ``not statistically significant'' while describing all prominent self-play experiments in computer chess below.

1982: Thompson [206].

Thompson's pioneering experiment featured 100 self-play games with matches of 20 games each between versions of BELLE differing by exactly one ply in lookahead for fixed search depths of 3-8 plies. The gain in playing strength averaged at 246 rating points per ply of search. The experiment showed no diminishing returns at any search depth but its results are not statistically significant anyway.

1983: Condon and Thompson [57].

In the second experiment, Condon and Thompson let BELLE self-play 300 games in round-robin style with matches of 20 games each between all program versions for fixed search depths of 3-9 plies. The gain in playing strength averaged at 217 rating points per ply of search. Now the observed ratings hinted at limited diminishing returns from a fixed search depth of 6 plies onwards. Yet the results of the experiment are not statistically significant.

1988: Szabo and Szabo [198].

The two Szabos determined the technology curve of their chess program TECHMATE that self-played 6882 games on two Atari ST computers linked together via their MIDI ports. The number of games per match between longer and shorter searching versions of the program varied strongly from a minimum of 32 to a maximum of 1367. The gain in playing strength averaged at 156 rating points per doubling of available search time (computing power). The experimental data indicated slight diminishing returns at longer search times. Several results from the experiment are statistically significant. Unfortunately, however, this does not hold for the results of the most interesting program versions with really long search times. They simply did not play enough games to draw reliable conclusions.

1990: Berliner et al. [28].

The HITECH team made their chess machine self-play 1056 games in a round-robin setting with matches of 16 games each between all program versions of HITECH and LOTECH (a variant of HITECH scaled down knowledge-wise) for fixed search depths of 4-9 plies. The gain in playing strength averaged at 195 rating points per ply of search for HITECH and at 232 rating points per ply for LOTECH. The overall ratings showed signs of limited diminishing returns starting at a fixed search depth of 6 plies. But there was no clear trend of diminishing returns at higher search depths and the experimental results are not statistically significant.

1994: Mysliwietz [158].

In his own experiment, Mysliwietz let the parallel chess program ZUGZWANG self-play 450 games with 50 games per match between program versions that differed roughly by a factor of 2 in search speed due to varying numbers of allotted processors. The gain in playing strength averaged at 109 rating points per doubling of search speed for 9 successive doubling steps. The observed ratings do not exhibit any diminishing returns at all. Most of the match results from this experiment are statistically significant in the sense that they allow for the discrimination of stronger and weaker opponents with 95% confidence. The true general rating gain for the version of ZUGZWANG used in the experiment lay in the estimated range of 76-143 points per ply of search with 95% confidence. Thence, Mysliwietz's experiment does neither support nor falsify the hypothesis of diminishing returns for self-play of ZUGZWANG with good confidence.

1997: Junghanns et al. [121].

The self-play experiment with Junghanns' chess program THE TURK featured 480 games with matches of 80 games each between program versions differing by exactly one ply in lookahead for fixed search depths of 3-9 plies. The gain in playing strength averaged around 200 rating points per ply of search.^1.8 The winning percentages of the deeper searching versions of THE TURK actually increased steadily from fixed search depths of 6 plies onwards, thus even hinting at additional gains in returns for higher search depths rather than diminishing ones. The match results from this experiment are statistically significant in the sense that they allow for the discrimination of stronger and weaker opponents with 95% confidence. As in the case of Mysliwietz's experiment, however, the resulting data of the THE TURK neither supports nor falsifies the hypothesis of diminishing returns for self-play of this program with good confidence.

In their article Junghanns et al. [121] then continued to look for diminishing returns by means of other metrics than self-play. They finally claimed to have found empirical evidence in this respect. According to their explanations the low search quality of chess programs (i.e. their high error probability) and the abnormally large lengths of self-play games inadvertently hide the doubtlessly existent diminishing returns in computer chess. Although we greatly appreciate Junghanns et al.'s new trial aimed at the better understanding of diminishing returns in computer chess, we are not convinced that their claims hold when subjected to rigorous methodological and statistical testing. In our opinion the quest for indisputable and statistically significant demonstrations of diminishing returns for additional search in computer chess still remains to be concluded.

Created by Ernst A. Heinz, Thu Dec 16 23:28:11 EST 1999