First Prize in the 1998 ICFP Functional
Programming Contest went to Cilk Pousse, a program written in the
Cilk multithreaded
programming language being developed by the
Supercomputing Technologies research group (SuperTech) of the MIT
Laboratory for Computer Science (LCS). Although Cilk is not a
functional language, it provides high-level linguistic abstractions
for parallelism, and the contest was explicitly open to programs
written in any language. (Half the entries were programmed in C.)
The challenge
task was to write a program to play the game of ``Pousse'', a
variant of tic-tac-toe played on an N-by-N board.
The programs competed on a 4-processor Pentium Pro machine running a
Linux operating system. Teams had 72 hours to complete the
programming. Almost 50 programs were entered into the contest. In
the course of the competition, the Cilk Pousse program was undefeated
in all games, even though in Pousse, the first player has a
significant advantage. The First Prize citation recommends Cilk as
``the superior programming tool of choice for discriminating
hackers.''
Download
|
The code for Cilk Pousse is available here.
Cilk Pousse Team (in increasing order of seniority) |
- Reid Barton, high-school sophomore and Visiting Scholar in LCS;
- Dan Adkins, MIT sophomore majoring in computer science;
- Harald Prokop,
grad student in LCS;
- Matteo Frigo,
grad student in LCS;
- Chris Joerg, research scientist at Compaq Cambridge Research Center;
- Martin Rinard, assistant professor in LCS;
- Don Dailey, systems administrator in LCS;
- Charles Leiserson,
professor in LCS.
Matteo, who is an ML fan (he is coauthor of FFTW, the Fastest Fourier
Transform in the West, a high-performance C code generated by an ML
metaprogram), originally noticed the ICFP contest announcement. Since
the contest was to be run on a 4-processor shared-memory machine, and
the SuperTech group had been developing the Cilk multithreaded
language specifically for this kind of platform, Charles and Matteo
decided to take a look when the challenge task was posted and see if
we wanted to enter.
Reid, Harald, and Don of the SuperTech group agreed to participate, as
did Chris, who had been a member of the group before getting his
Ph.D. from MIT several years ago. All were experienced Cilk
programmers (which isn't hard, since Cilk takes almost no time to
learn). At the last minute, we also gathered Dan and Martin, who had
no Cilk experience, but who knew C well.
When the challenge task of a Pousse-playing program was announced at
5pm on Thursday, August 27, 1998, we immediately decided to enter.
The SuperTech group has a long history of world-competitive parallel
chess-playing programs, including *Socrates and Cilkchess. We knew
the domain, and we had the Cilk language technology to exploit
parallelism.
By midnight on Thursday, we had a working C version of a Pousse
program, which except for time control, could be submitted. We also
had an autotester running to allow us to play versions of the program
against itself. By Friday noon, we had a parallel Cilk version
running with time control. All that remained was to optimize the
program and tune the evaluation function. We spent the remainder of
the weekend doing just that.
Our description of Cilk Pousse
|
- A brief overview of the Cilk language
- Parallel search algorithm
- Move ordering
- The hash table and repetitions
- Evaluation function
- Time control
- Testing
- Conclusion
A brief overview of the Cilk language
|
We programmed our entry in Cilk, a C-based language for
multithreaded programming. Cilk is an algorithmic multithreaded
language. The philosophy behind Cilk is that a programmer should
concentrate on structuring the program to expose parallelism and
exploit locality, leaving Cilk's runtime system with the responsibility
of scheduling the computation to run efficiently on a given platform.
Thus, the Cilk runtime system takes care of details like load
balancing, paging, and communication protocols. Unlike other
multithreaded languages, however, Cilk is algorithmic in that the
runtime system guarantees efficient and predictable performance.
To give you an idea of how simple it is to write parallel programs in
Cilk, here is a Cilk implementation of the familiar recursive
Fibonacci program in which the recursive calls are executed in
parallel:
cilk int fib (int n)
{
if (n<2) return n;
else
{
int x, y;
x = spawn fib (n-1);
y = spawn fib (n-2);
sync;
return (x+y);
}
}
Notice that if you elide the three Cilk keywords (shown in red), you obtain a C program, called the serial
elision or C elision of the Cilk program. Cilk is a
faithful extension of C in that the C elision of any Cilk
program is always a legal implementation of the semantics of the Cilk
program.
The keyword cilk identifies a
Cilk procedure, which is the parallel version of a C function.
A Cilk procedure may spawn subprocedures in parallel and synchronize
upon their completion. A Cilk procedure definition is identified by
the keyword cilk and has an argument
list and body just like a C function.
Most of the work in a Cilk procedure is executed serially, just like
C, but parallelism is created when the invocation of a Cilk procedure
is immediately preceded by the keyword spawn. A spawn is the parallel analog of a C
function call, and like a C function call, when a Cilk procedure is
spawned, execution proceeds to the child. Unlike a C function call,
however, where the parent is not resumed until after its child
returns, in the case of a Cilk spawn, the parent can continue to
execute in parallel with the child. Indeed, the parent can continue
to spawn off children, producing a high degree of parallelism. Cilk's
scheduler takes the responsibility of scheduling the spawned
procedures on the processors of the parallel computer.
A Cilk procedure cannot safely use the return values of the children
it has spawned until it executes a sync statement. If all of its children have
not completed when it executes a sync, the procedure suspends and does not
resume until all of its children have completed. The sync statement is a local ``barrier,'' not a
global one as, for example, is sometimes used in message-passing
programming. In Cilk, a sync waits
only for the spawned children of the procedure to complete, not for
the whole world. When all of its children return, execution of the
procedure resumes at the point immediately following the sync statement. In the Fibonacci example, a
sync statement is required before
the statement return (x+y), to avoid the anomaly that would
occur if x and y were summed before each had been
computed. (In general, sync
keywords cannot be inferred by the compiler, because Cilk is not
functional. The compiler has no way of knowing, for example, if a
void Cilk procedure is performing a side-effect on shared memory that
must be waited for.) A Cilk programmer uses spawn and sync keywords to expose the parallelism in a
program, and the Cilk runtime system takes the responsibility of
scheduling the execution of the procedures efficiently.
The Cilk Pousse program takes advantage of some of Cilk's advanced
features for nondeterministic programming. Ordinarily, when a spawned
procedure returns a value, it is simply stored into a variable in its
parent's frame:
x = spawn foo(y);
Cilk's inlet feature allows the returned value to be
incorporated into the parent's frame in a more complex fashion. An
inlet is essentially a C function internal to a Cilk procedure.
Normally in Cilk, the spawning of a procedure must occur as a separate
statement and not in an expression. An exception is made to this rule
if the spawn is performed as an argument to an inlet. In this case,
the procedure is spawned, and when it returns, the inlet is invoked.
In the meantime, control of the parent procedure proceeds to the
statement following the inlet. The following code illustrates how the
fib() function might be coded using inlets.
cilk int fib (int n)
{
int x = 0;
inlet void summer (int result)
{
x += result;
return;
}
if (n<2) return n;
else {
summer(spawn fib (n-1));
summer(spawn fib (n-2));
sync;
return (x);
}
}
The inlet summer() is defined to take a returned value
result and add it to the variable x in the frame of
the procedure that does the spawning. All the variables of
fib() are available within summer(), since it is an
internal function of fib().
One last advanced feature of Cilk was used in the Cilk Pousse program.
Sometimes, a procedure spawns off parallel work which it later
discovers is unnecessary. Cilk allows this ``speculative'' work to be
aborted through the use of Cilk's abort statement. The abort statement, when executed inside an
inlet, causes all of the already-spawned descendants of the procedure
to terminate prematurely. Cilk takes the responsibility of tracking
down all the descendants and terminating them.
Parallel search algorithm
|
For a game-tree search, such as that required for Pousse, the natural
choice is alpha-beta minimax search with iterative deepening.
Alpha-beta is a well-known technique that uses information learned
from partial searches to prune branches of the tree without
compromising the effectiveness of the search. Alpha-beta search is an
inherently serial algorithm, however, and thus we had to parallelize
it.
Alpha-beta search is described in any textbook on artificial
intelligence that covers heuristic search, so we shall assume some
familiarity with this searching strategy. The basic idea is that if
Player X makes a move in a position that is so good that Player O
won't make the move that leads to the position, then there is no point
in searching X's other moves from that position. Those additional
moves can be pruned. In order to get maximal pruning, therefore, it
is advantageous to search the moves at a node in the search tree in
best-first order.
Iterative deepening is the process of first looking 1 move ahead, then
starting from scratch searching 2 moves ahead, and so on until you run
out of time. This strategy has the advantage that when the time limit
expires (30 seconds in the case of the Pousse competition), the
program always has a move available to play, namely the best move from
the most recently completed iteration. Since the cost of each iteration
grows exponentially, little time is lost in searching all iterations
up to the last.
In fact, iterative deepening works better than a straight alpha-beta
search to a given depth. Recall that alpha-beta search is most
efficient if the best move at a node in the search tree is searched
first. But what is the best move at a given node? (Isn't that what
the search is supposed to determine?) Although it is not possible to
predict the best move from a node, a good guess is the best move from
the last time this position was searched. During an iteration, this
best-move information is stored in Cilk Pousse's hash
table, and then looked up during subsequent deeper iterations.
For instance, the depth-9 search places best-move information in the
hash table to provide a starting move to try on the depth-10 search.
This information speeds up the depth-10 search significantly, because
alpha-beta pruning causes the search to examine many fewer nodes.
We'll further discuss this issue in the section on move ordering.
The basic alpha-beta searching algorithm is inherently serial, using
information from the search of one child of a node to prune subsequent
children. For Cilk Pousse, we wanted to take advantage of
parallelism. But, when children are searched in parallel, it is hard
to use information gained from searching one child to prune another.
If one looks at an optimal game tree, however, one finds an
interesting property: all of the nodes are either full (all of the
children are searched) or singular (only one of the children is
searched).
Cilk Pousse's parallel search algorithm takes advantage of this
property. It first searches what it considers to be the best child,
just like serial alpha-beta search. When that child returns, it may
be that the alpha-beta algorithm prunes the rest of the children (a
so-called beta-cutoff), and the search returns immediately.
Otherwise, Cilk Pousse's algorithm speculates that the node is full,
and it spawns off all the remaining children in parallel. If one
returns with a score that causes a beta-cutoff, the other children are
aborted, since their work has been rendered unnecessary. (A similar
parallel algorithm was used for our
*Socrates parallel chess program.) On 4 processors, we see a
speedup of over 3 in the 30-second search.
Some of our opponents adopted the simple strategy of creating four
parallel threads, each of which searches 1/4 of the moves at the root.
This strategy makes ineffective use of parallelism, since many moves
are searched that would be otherwise be pruned in a serial search. In
contrast, the recursive parallel search used by Cilk Pousse exposes
much more parallelism without searching many nodes that would
otherwise be pruned. Cilk's runtime system automatically schedules
the abstract computation to make efficient use of the available
processors.
Below is some code abstracted from the Cilk Pousse program to
illustrate how the parallel search strategy is supported by Cilk's
linguistic primitives. Although the code has been stripped of some of
the low-level details, but it is almost completely usable as is. It
also serves to demonstrate how easy it is to write a complex parallel
search algorithm in Cilk. (The principal missing elements are the
move-ordering code and checking of hash-table moves.) Pay particular
attention to the inlet catch, which contains an abort statement when a beta-cutoff occurs.
cilk int search( position *prev, int depth, int update_move )
{
position cur;
int bestscore = -INF;
int x;
int mv;
int sc;
int cutoff = 0;
int aborted = 0;
inlet void catch(int ret_sc, int ret_mv) {
ret_sc = -ret_sc;
if (ret_sc > bestscore)
{
bestscore = ret_sc;
if (ret_sc >= cur.beta) {
aborted = 1;
abort;
}
if (ret_sc > cur.alpha) cur.alpha = ret_sc;
}
}
x = make_move( prev, &cur, update_move );
sc = eval(&cur);
if (abs(sc)>=MATE || depth <= 0)
{
return( sc );
}
cur.alpha = -prev->beta;
cur.beta = -prev->alpha;
/* try best move first (and hopefully best move) */
if (!cutoff)
{
sc = spawn search( &cur, depth-1, first_move );
sync; /* the sync makes this first move serial */
sc = -sc;
if (sc > bestscore)
{
bestscore = sc;
if (sc > cur.alpha) cur.alpha = sc;
if (sc >= cur.beta) {
cutoff = 1;
}
}
}
/* cycle through all the moves */
if (!cutoff){
if (!aborted)
for (mv=2nd_move; mv<=last_move; mv++)
{
catch(spawn search( &cur, depth-1, mv), mv );
if (aborted) break;
}
}
sync; /* this sync is done outside the loop to search these
moves in parallel */
return( bestscore );
}
Move ordering
|
The order in which moves are searched can have a great effect on the
number of nodes that must be visited to complete a search. If we
search the best node first, we may only have to search that one move,
but if we search the best move last, we may have to search all moves.
Even if we don't try the best move first, trying a ``good''
move first is still helpful, as the score returned by this move may
allow us to decrease the search window (the region between alpha and
beta) for future moves.
There are two heuristics we use to improve our move ordering.
The first is that if there is a hash-table move available (see the
section on the hash table), we search that move
first. This move was the best move last time we visited this
position, so it is likely to be so again.
The second heuristic is to keep a table of countermoves.
Each time in the search that a move is found which beats beta, we
store that move in a table indexed by the opponent's previous move.
(There is a separate table for each side.) In effect we are building
up a table that says, ``Last time the opponent made move x, move y was
a good response.'' When we begin a search of a node, we do a lookup
in this table (indexed by the previous move) to find a move. We
search this move first if there is no hash-table move (which is most
of the time), or second if there is a hash-table move. As a further
improvement, it is possible to store the last r good
countermoves for each move, and try those moves early in a search. Cilk
Pousse actually uses r=2.
Using this countermove table had a profound effect on the speed of the
search. On the (admittedly few) positions on which we tested this
heuristic, we found that using this reply table gave us a 5 to 10
times reduction in the number of nodes that we needed to visit
when searching those positions to a given depth.
The hash table and repetitions
|
Our program keeps a hash table of searched positions. This standard
transposition table is commonly used in chess programs with a few
modifications to suit it to Pousse. Briefly, for each position
searched, a transposition table stores
- the move chosen by the search,
- the score the search returned, and
- the depth of the search.
Then, each time a search of a position is begun. a hash-table lookup
is done to see if that position has been searched before. If so, the
hash table information is used for two purposes:
- Improving move ordering: The hash-table move is used as the
first move searched. This is likely to be the best move, and starting
with a good move reduces the amount of searching to be done.
- Pruning the search: If the previous search of this node was ``deep
enough,'' then we may be able to use the score and avoid researching
the position. (For the Pousse program, we also require that the
hashed search result ends with the same side-to-move as does the
current search, as side-to-move can have a large (and varying)
effect on the value of a position).
The hash function on positions used the bitwise exclusive-or (XOR) of
a collection random numbers stored in a table. For each square on the
board and for each piece type (X, O, or empty) on the square, the
table contains a random number. The hash function is simply the XOR
of the table entries for the pieces on the board. Although this
function might sound expensive to compute, it can be updated
incrementally as moves are made, since XOR its own inverse.
Detection of repeated positions was essential for the game of Pousse,
since the side that causes a repeated position loses. Most
game-playing programs use their hash table to detect repeated
positions. When a position is encountered, it is marked as ``open''
while the search of its descendants proceeds. Once the search on that
node is complete, the position is marked as ``closed.'' If the search
encounters an open position, the position is repeated. Unfortunately,
this simple strategy does not work for parallel search, since it is
hard to tell which positions were opened by your ancestor as opposed
to by a parallel thread.
To detect repeated positions, Cilk Pousse maintains for every board
position a pointer to its parent's position. When evaluating a
position, we walk the chain of ancestors from the current position
backwards to the beginning of the game and compare to see if the same
position already appears in the chain. (We actually check only every
second position, since a repeated position must have the same
side-to-move.) Each position is labeled with a hash value to make
this comparison easy and fast. We hash each position into a 64-bit
number and hope that the hash function is perfect, i.e., two different
positions do not collide.
An optimization was to stop the scan of the ancestor chain as soon as
an irreversible move was found. In the Pousse game, the
number of pieces on the board cannot decrease during the game. Thus,
any move that does not push a piece off the board is irreversible.
The previous position cannot ever appear later in the game.
Consequently, there is no point in looking for a repeated move once
the search routine finds an irreversible move.
Towards the end of the game, most moves are reversible, and we
worried that the scanning of the ancestor chain might require too much
time. We had not observed this problem, but on the other hand,
we had not played many long games. Consequently, we made a small hash
table of all positions already encountered in the actual game, since
these are fixed throughout the search. When a position is encountered
in the search, a check is first made into this hash table. If the
position is not there, then the ancestor chain is scanned as before,
terminating at the root of the search tree. Although this
modification did not appear to have any effect on the performance of
Cilk Pousse, we decided to incorporate it into the final submission
because it produced a better guarantee of performance in the endgame.
The repeated position rule causes a major wrinkle in using a hash
table for Pousse. Specifically, it can be dangerous to use the score
from the hash table. (The ``best move'' in the hash table is always
safe to use, since it is only a heuristic suggestion.) The problem is
that the score of a position may depend on the path that leads to it.
Specifically, if the path contains the position, then the score is a
loss for the player on move.
One solution to this problem is just to ignore it and always accept
the score, hoping that the benefits from the correct uses of the score
outweigh the problems from using incorrect scores. Some chess
programs adopt this approach. For Pousse, however, the value of a
repeated position is a loss, rather than a draw as in chess, and many
repeated positions occur in the endgame. Thus, we decided it would be
too dangerous to adopt this approach. The alternative of not using
the hash-table scores at all struck us as wasteful, however, because
it leads to excessive redundant work.
Our solution was to select a subset of the circumstances in which it
is safe to use the score, and only use the score in those
circumstances. We made the observation that when an irreversible move
is made, no position reached before that move can ever be reached
again. Our algorithm for using hash table scores was the following.
When at position p:
- Store a score for p in the hash table only if
the move leading to p is irreversible.
- Look up a hash score for p only if the move
leading to p is irreversible.
By ensuring that the move leading to position p is
irreversible, we guarantee that the search beginning at p is
not (was not) influenced by the positions leading up to p.
There are two optimizations to this algorithm that we used, both of
which take advantage of additional circumstances when the moves
leading to p are irrelevant.
- A1. If the lookup move/score beats beta and is an irreversible move,
then it is safe to use.
- A2. If the searched move/score beats beta and is an irreversible move,
then it is safe to put into the hash table.
- B1. If the lookup move/score is a losing score (i.e., doesn't beat
beta), and the depth is 1, then use the score.
- B2. If the searched move/score is a losing score, and the depth is 1,
and we do not lose because of a repeat,
then it is safe to put the score into the hash table.
Evaluation function
|
Immediately upon receiving the challenge task, we played Pousse games
among ourselves to get a feel for how the game is played and decide
what terms to put in our evaluation function. By the end of Thursday
evening, we had come up with four terms:
- Piece count: Each of your pieces is worth a constant
number of points. Initially, for an N-by-N board,
we assigned the weight 4N to this term, but later, we reduced
it to 4.
- Centrality: Each of your pieces receives a bonus equal
to the Manhattan distance from the piece to the nearest corner. That
is, a piece on file i and rank j receives a bonus of
min(i,N-i) + min(j,N-j).
- Frank-squared: We coined the term ``frank'' to refer to
a file or rank. You score k2 points for
each frank having k of your pieces.
- Connectivity (The final program did not use this term.)
Each pair of your pieces that are orthogonally adjacent receives a
bonus of 2 points. Each diagonally adjacent pair receives 1 point.
Our initial evaluation function was obtained by summing these terms
for Player X and subtracting the analogous sum for Player O.
Once we had a working autotester capable of playing multiple games, we
tried changing the relative weights of the terms, as well as adding
and removing terms. We ran these tests mostly on 6-by-6 boards and
occasionally on 4-by-4, 5-by-5, or 7-by-7 boards. Both sides
were run with a fixed-depth search, rather than with a timed search,
because we wanted to gauge the benefit of the heuristic before
worrying about optimizing performance.
We found that the initial weight 4N for the piece-count term
was too large. Reducing it to 0 improved the program. On Saturday
night, however, we used a 4-processor Alphaserver to run a
minitournament consisting of 3600 games played on a 6-by-6 board.
Among 9 programs with different piece values ranging from -4 to 24, we
found that a piece-count value of 4 produced a statistically
significant advantage, which we then used for our submission. As a
simple optimization, we incorporated piece-count into the centrality
bonus by building a single look-up table indexed by square containing
the sum of the centrality and the piece-count bonuses.
We tried a couple of nonlinear functions for centrality, but they did
not help and sometimes hurt. Leaving out the centrality term hurt,
however, because the program would tend to lose by tactics in the
early game. Apparently, it is important in the opening to favor the
center. Later in the game, centrality appears not to be as important.
So, we stuck with our original term.
We tried replacing the frank-squared term with either a frank-cubed or
frank-to-the-1.5 term, but neither performed better or worse when
autotested against the frank-squared term. So, we stuck with our
original term.
We found that the connectivity term produced no effect one way or the
other, so we removed it altogether. We were glad to get rid of it, as
it took a long time to compute.
Over the three days of the competition, we tried changing the
evaluation function in many ways, but all we ended up doing was
eliminating the connectivity term and tuning the piece-count term. No
new evaluation term outperformed the original (and many did worse).
So, we just left the evaluation function as it was, and concentrated
instead on optimizing it.
Our main techique to optimize performance of the evaluation function
is to incrementally update the score rather than to compute it from
scratch for each new position. In addition to the position itself, we
maintain a count for each frank and player of the number of
pieces that that player has in the frank. When a move is made, we update
only the frank counts that change. Then, we sum the squares (using
table lookup) of the franks. This process changes what would naively
be a quadratic-time calculation into a linear-time evaluation.
Likewise, we update the piece-count/centrality score incrementally
when the move is made by subtracting out the old value of each square
that changes and adding in the new. We incrementally update the total
number of pieces for each player, which is used for repetition
detection. We also incrementally update the hash function by XOR'ing
out the old random values for squares that change and XOR'ing in the
new values.
Time control
|
The 30-second time control was implemented with a hack. Cilk Pousse
spawns a timer thread that checks the time every few milliseconds,
setting a flag at timeout. After spawning the timer thread, it spawns
the main search routine. The search routine is not spawned once,
however, but many times, due to iterative deepening. The loop
spawning the search needs to sync after every search, in order to save
the result of the search. In Cilk one cannot wait for a single thread
to complete. Instead, the sync
statement waits for all threads spawned by the current procedure to
complete. In particular, it waits until the timing thread terminates,
which is not the behavior we wanted.
During the 72-hour contest, we completely missed the obvious solution
to this problem. The obvious solution is to spawn the timer thread
and then spawn a separate procedure performing the iterative-deepening
loop. In this way, the iterative-deepening procedure can sync without
having to worry about the timer thread. Instead, we implemented a
complicated protocol, such that whoever returns first (whether the
search or the timer) terminates the other thread. The details of the
mechanism are in the code. We don't recommend it for light reading.
The timeout was conservatively set to 27 seconds, figuring that 3
seconds was long enough for us to shut down. Chris suggested that we
should use 25 seconds, just in case. While nobody else took the
proposal seriously, we got a good scare during the tournament.
In fact, time control turned out to be the most problematic aspect of
Cilk Pousse after the tournament started. The tournament directors
reported that our program was systematically running out of time. The
program was printing an answer after 27 seconds, but it took as much
as 5 more seconds before control was returned to the referee program.
Although the reason for this delay has not yet been completely
understood, it appears to be a weird interaction between the Cilk
memory allocation mechanism, which uses mmap(), the Linux
kernel, and the referee. In short, Linux seems to need a long time to
dump all mmap-ed pages to disk (even though those pages are not needed
afterwards and could be discarded). This process occurs after the
Cilk program exits, but before the program is officially declared dead
by Linux. The referee was killing processes created by the Cilk
program after it printed an answer, but before page dumping was
complete. Apparently, this action by the referee, which involves a
tight loop of system calls, interferes with and further delays the
dump. The problem disappeared after the tournament directors modified
the referee to wait 30 seconds in all cases before killing a game
program, even if it prints an answer earlier.
Testing
|
To improve our program, we implemented an autotester that allowed us
to play games between different versions of the program, as well as
with a human. We had considerable power at our disposal (several
4-processor 467-MHz Alphaservers, a bunch of 8-processor Sun
Enterprise 5000's, and a couple of Pentium Pro multiprocessors running
Linux). A first version of the tester was ready by Thursday night of
the contest. It enforced the repetition rule, and it also allowed
humans to play. By Friday morning, the 30-second timeout rule was
also in place. Over the next couple of days, various improvements
were made, including the possibility of playing multiple games, which
was the ultimate goal of the testing program, and an improved human
playing mode that allowed moves to be retracted.
By Friday night, the team working on evaluation functions had
implemented several strategies that they wanted to test. The two
professors, who were on this team, set up the autotester to run
overnight to play hundreds of games. Embarrassingly, the professors
forgot to invoke the autotester with the -r option, which
randomizes the opening. The autotester played the same games over and
over all night, wasting an estimated 67,248,000,000,000 Alphaserver
cycles. (It seemed strange in the morning that all games in a match
were always won by the same program, or else both sides had won
exactly 50 percent.) The Alphaservers were later used effectively to
tune the terms of the evaluation function.
On Saturday, we exchanged binaries with Bradley Kuszmaul of Yale,
who wrote the Alpha
Beta Soupa entry. Bradley was one of the original developers of
Cilk when he was a grad student at MIT. He was also coauthor of MIT's
*Socrates parallel chess program, which uses a parallel-search
algorithm similar to that of Cilk Pousse. We made a gentlemen's
agreement not to analyze each others' binaries, but just run them to
test our own programs against an adversary other than ourselves. We
gained confidence from a few test runs against Bradley's program, but
having Bradley's program around may have been more of a distraction
than providing any actual help.
Because Bradley is an expert in parallel chess programs and a
developer of Cilk, we naturally assumed that he was programming in
Cilk. After the contest was over, however, we discovered that he had
eschewed parallelism and written his program entirely in C. Perhaps
he would have done better (he tied for second, but lost in the
playoff) had he used the technology he helped develop. (Serves you
right, Bradley, but we still love you.)
One of the difficult bugs that arises in parallel code is that of
a data race. A data race occurs when two logically parallel
threads, holding no lock in common, access the same location, and at
least one of the two threads performs a write. A data-race bug is
usually fixed by ensuring the two threads acquire a common lock during
their accesses, or else by serializing the computation so that the
threads do not operate in parallel. Race bugs are insidious, because
they often do not arise during ordinary execution, but can cause
serious errors when they do arise.
The Cilk development environment provides a race-detection tool,
called the Nondeterminator.
By using this tool, we were able to certify with high confidence that
no inadvertent race bugs existed in the code. The Nondeterminator
executes the Cilk code serially on a given input, using a novel data
structure that keeps track of what threads operate logically in
parallel. Every read and write by the program is instrumented to see
if a data race exists. The Nondeterminator is not a verification
tool, since it simulates actual execution on a given input, but it
does provide certain guarantees of finding races if they exist.
We intentionally left one race bug in the code. We do no locking on
accesses to the hash table. Consequently, a read may think it is
accessing the score of a given position, but the hash-table entry may
be nonatomically updated by another thread between the read of the
position and the read of the score. We determined that the overhead
for locking would weaken the program more on average than if we did no
locking and risked that a race might occur during the competition that
actually affected the outcome. Thus, Cilk Pousse is provably
non-bug-free. :)
Conclusion
|
The 72-hour contest was exhausting, but a great experience for all the
team members. We entered the contest to test Cilk technology, and we
feel it measured up well. We were the only team with a sophisticated
parallel search algorithm, the coding of which was well supported by
the Cilk language. As a tribute to the effectiveness of Cilk, most of
our time was spent not on parallel programming, but on the other
aspects of the task. We indeed benefited from our experience with
chess-playing programs, but mainly in that it allowed us to organize a
much larger team than would have been possible had we no experience
with computer game playing.
We would like to thank the ICFP tournament directors, Marc Feeley and Olin Shivers, for their
efforts in running the tournament. They worked tirelessly to get as
many programs as possible to run when system problems arose. We are
grateful for their dedication to what must have been an onerous task.
|
|