Newborn's Hypothesis Concerning the Strength of Chess Programs

In 1985 Newborn discovered that the self-play results of BELLE correlated closely with the rates of new best moves as chosen by it during searches to fixed depths of 11 plies on a set of 447 test positions from real games. Moreover, the behaviour of TECH III [199] during searches to fixed depths of 11 plies for the 300 well-known ``Win At Chess'' test positions resembled the behaviour of BELLE during its equally deep searches. Based on these two observations Newborn [163] formulated an interesting hypothesis concerning the playing strength of chess programs in general. His hypothesis relates the increases in playing strength for deeper searches to the rates of new best moves as chosen at consecutive search depths (deeper versus shallower). Specifically, let BC(k) denote the rate or number of new best moves as observed in iteration #k and let RI(k) denote the actual increase in playing strength for searches of depth k in comparison with searches of depth k-1.

Newborn's Hypothesis.
Suppose values for BC(k), BC(k+1), and RI(k) have been determined over a wide sample of game positions. When going from searches of depth k to searches of depth k+1, the rating of the program in question can then be expected to improve by

\begin{displaymath}RI(k+1) = RI(k) * \frac{BC(k+1)}{BC(k)}\,.
\end{displaymath}

The hypothesis intuitively relies on the straightforward yet very important assumption that the new best moves as determined by deeper searches are of higher quality than the existing best moves which they displace. If this fundamental rationale is not fulfilled, the hypothesis does not make much sense at all. Please imagine a chess program that simply switches back and forth between a few good moves all the time. Such behaviour does surely not increase the playing strength of the program at any search depth. Therefore, the discovery of ``fresh ideas'' (e.g. new best moves which the program never deemed best before) looks like a much better and more meaningful indicator of increases in playing strength than just the change of the preferred choices. Surprisingly enough, nobody seems to have investigated this aspect of Newborn's hypothesis prior to us. In Section 1.5.6 we present empirical evidence that supports the validity of the hypothesis in this respect. According to the experimental results of both CRAFTY and DARKTHOUGHT sizable 30%-50% of all new best moves as chosen by modern chess programs represent ``fresh ideas'' on average.

Newborn's hypothesis neatly fitted the results of the self-play experiments and the behavioural experiments to fixed search depths of 11 plies as available for BELLE in 1985. To the best of our knowledge, however, no researcher ever tested the validity of the hypothesis for higher search depths than 11 plies nor for other chess programs than BELLE. Nevertheless, in 1997 Hyatt and Newborn [114] applied the above formula of expected rating increase for an additional ply of search to their experimental results of CRAFTY in order to calculate ``extrapolated'' ratings of the program at search depths of 6-14 plies. They seeded their calculations with the according ratings of BELLE from 1985 as given values for the 4-ply and 5-ply searches. Hyatt and Newborn obviously interpreted the similar search behaviours of BELLE and CRAFTY on different sets of test positions as sufficient reason to assume the validity of Newborn's hypothesis for CRAFTY as well.

We object to Hyatt and Newborn's ``extrapolation'' because in our opinion neither the application of the hypothesis itself nor the repeatedly recursive style of its application were appropriate in this case. First and foremost, there exist no published results of self-play experiments for CRAFTY at any fixed search depths such that not even a single value of RI(k) could be determined in compliance with the prescriptions of the hypothesis. Instead, Hyatt and Newborn rather calculated them too by repeatedly applying the recursive formula of expected rating increase to each additional ply of search. This essentially means that Hyatt and Newborn employed the hypothesis in order to derive assumedly independent preconditions of itself. These fabricated preconditions then served as the basis for the next true application of the hypothesis yielding the calculation of an absolute rating and so forth. Thence, the ratings of CRAFTY as postulated by Hyatt and Newborn in Figure 7 of their article (e.g. 2601 points at a fixed search depth of 11 plies and 2983 points at a fixed search depth of 14 plies) lack real substance.



Created by Ernst A. Heinz, Thu Dec 16 23:28:11 EST 1999