Cilk Pousse


First Prize in the 1998 ICFP Functional Programming Contest went to Cilk Pousse, a program written in the Cilk multithreaded programming language being developed by the Supercomputing Technologies research group (SuperTech) of the MIT Laboratory for Computer Science (LCS). Although Cilk is not a functional language, it provides high-level linguistic abstractions for parallelism, and the contest was explicitly open to programs written in any language. (Half the entries were programmed in C.) The challenge task was to write a program to play the game of ``Pousse'', a variant of tic-tac-toe played on an N-by-N board. The programs competed on a 4-processor Pentium Pro machine running a Linux operating system. Teams had 72 hours to complete the programming. Almost 50 programs were entered into the contest. In the course of the competition, the Cilk Pousse program was undefeated in all games, even though in Pousse, the first player has a significant advantage. The First Prize citation recommends Cilk as ``the superior programming tool of choice for discriminating hackers.''

Download

The code for Cilk Pousse is available here.

Cilk Pousse Team (in increasing order of seniority)

  • Reid Barton, high-school sophomore and Visiting Scholar in LCS;
  • Dan Adkins, MIT sophomore majoring in computer science;
  • Harald Prokop, grad student in LCS;
  • Matteo Frigo, grad student in LCS;
  • Chris Joerg, research scientist at Compaq Cambridge Research Center;
  • Martin Rinard, assistant professor in LCS;
  • Don Dailey, systems administrator in LCS;
  • Charles Leiserson, professor in LCS.
Matteo, who is an ML fan (he is coauthor of FFTW, the Fastest Fourier Transform in the West, a high-performance C code generated by an ML metaprogram), originally noticed the ICFP contest announcement. Since the contest was to be run on a 4-processor shared-memory machine, and the SuperTech group had been developing the Cilk multithreaded language specifically for this kind of platform, Charles and Matteo decided to take a look when the challenge task was posted and see if we wanted to enter.

Reid, Harald, and Don of the SuperTech group agreed to participate, as did Chris, who had been a member of the group before getting his Ph.D. from MIT several years ago. All were experienced Cilk programmers (which isn't hard, since Cilk takes almost no time to learn). At the last minute, we also gathered Dan and Martin, who had no Cilk experience, but who knew C well.

When the challenge task of a Pousse-playing program was announced at 5pm on Thursday, August 27, 1998, we immediately decided to enter. The SuperTech group has a long history of world-competitive parallel chess-playing programs, including *Socrates and Cilkchess. We knew the domain, and we had the Cilk language technology to exploit parallelism.

By midnight on Thursday, we had a working C version of a Pousse program, which except for time control, could be submitted. We also had an autotester running to allow us to play versions of the program against itself. By Friday noon, we had a parallel Cilk version running with time control. All that remained was to optimize the program and tune the evaluation function. We spent the remainder of the weekend doing just that.

Our description of Cilk Pousse

  1. A brief overview of the Cilk language
  2. Parallel search algorithm
  3. Move ordering
  4. The hash table and repetitions
  5. Evaluation function
  6. Time control
  7. Testing
  8. Conclusion

A brief overview of the Cilk language

We programmed our entry in Cilk, a C-based language for multithreaded programming. Cilk is an algorithmic multithreaded language. The philosophy behind Cilk is that a programmer should concentrate on structuring the program to expose parallelism and exploit locality, leaving Cilk's runtime system with the responsibility of scheduling the computation to run efficiently on a given platform. Thus, the Cilk runtime system takes care of details like load balancing, paging, and communication protocols. Unlike other multithreaded languages, however, Cilk is algorithmic in that the runtime system guarantees efficient and predictable performance.

To give you an idea of how simple it is to write parallel programs in Cilk, here is a Cilk implementation of the familiar recursive Fibonacci program in which the recursive calls are executed in parallel:

cilk int fib (int n)
{
    if (n<2) return n;
    else
    {
        int x, y;

        x = spawn fib (n-1);
        y = spawn fib (n-2);

        sync;

        return (x+y);
    }
}
Notice that if you elide the three Cilk keywords (shown in red), you obtain a C program, called the serial elision or C elision of the Cilk program. Cilk is a faithful extension of C in that the C elision of any Cilk program is always a legal implementation of the semantics of the Cilk program.

The keyword cilk identifies a Cilk procedure, which is the parallel version of a C function. A Cilk procedure may spawn subprocedures in parallel and synchronize upon their completion. A Cilk procedure definition is identified by the keyword cilk and has an argument list and body just like a C function.

Most of the work in a Cilk procedure is executed serially, just like C, but parallelism is created when the invocation of a Cilk procedure is immediately preceded by the keyword spawn. A spawn is the parallel analog of a C function call, and like a C function call, when a Cilk procedure is spawned, execution proceeds to the child. Unlike a C function call, however, where the parent is not resumed until after its child returns, in the case of a Cilk spawn, the parent can continue to execute in parallel with the child. Indeed, the parent can continue to spawn off children, producing a high degree of parallelism. Cilk's scheduler takes the responsibility of scheduling the spawned procedures on the processors of the parallel computer.

A Cilk procedure cannot safely use the return values of the children it has spawned until it executes a sync statement. If all of its children have not completed when it executes a sync, the procedure suspends and does not resume until all of its children have completed. The sync statement is a local ``barrier,'' not a global one as, for example, is sometimes used in message-passing programming. In Cilk, a sync waits only for the spawned children of the procedure to complete, not for the whole world. When all of its children return, execution of the procedure resumes at the point immediately following the sync statement. In the Fibonacci example, a sync statement is required before the statement return (x+y), to avoid the anomaly that would occur if x and y were summed before each had been computed. (In general, sync keywords cannot be inferred by the compiler, because Cilk is not functional. The compiler has no way of knowing, for example, if a void Cilk procedure is performing a side-effect on shared memory that must be waited for.) A Cilk programmer uses spawn and sync keywords to expose the parallelism in a program, and the Cilk runtime system takes the responsibility of scheduling the execution of the procedures efficiently.

The Cilk Pousse program takes advantage of some of Cilk's advanced features for nondeterministic programming. Ordinarily, when a spawned procedure returns a value, it is simply stored into a variable in its parent's frame:

x = spawn foo(y);
Cilk's inlet feature allows the returned value to be incorporated into the parent's frame in a more complex fashion. An inlet is essentially a C function internal to a Cilk procedure. Normally in Cilk, the spawning of a procedure must occur as a separate statement and not in an expression. An exception is made to this rule if the spawn is performed as an argument to an inlet. In this case, the procedure is spawned, and when it returns, the inlet is invoked. In the meantime, control of the parent procedure proceeds to the statement following the inlet. The following code illustrates how the fib() function might be coded using inlets.
cilk int fib (int n)
{
    int x = 0;
    inlet void summer (int result)
    {
        x += result;
        return;
    }

    if (n<2) return n;
    else {
        summer(spawn fib (n-1));
        summer(spawn fib (n-2));
        sync;
        return (x);
    }
}
The inlet summer() is defined to take a returned value result and add it to the variable x in the frame of the procedure that does the spawning. All the variables of fib() are available within summer(), since it is an internal function of fib(). One last advanced feature of Cilk was used in the Cilk Pousse program. Sometimes, a procedure spawns off parallel work which it later discovers is unnecessary. Cilk allows this ``speculative'' work to be aborted through the use of Cilk's abort statement. The abort statement, when executed inside an inlet, causes all of the already-spawned descendants of the procedure to terminate prematurely. Cilk takes the responsibility of tracking down all the descendants and terminating them.

Parallel search algorithm

For a game-tree search, such as that required for Pousse, the natural choice is alpha-beta minimax search with iterative deepening. Alpha-beta is a well-known technique that uses information learned from partial searches to prune branches of the tree without compromising the effectiveness of the search. Alpha-beta search is an inherently serial algorithm, however, and thus we had to parallelize it.

Alpha-beta search is described in any textbook on artificial intelligence that covers heuristic search, so we shall assume some familiarity with this searching strategy. The basic idea is that if Player X makes a move in a position that is so good that Player O won't make the move that leads to the position, then there is no point in searching X's other moves from that position. Those additional moves can be pruned. In order to get maximal pruning, therefore, it is advantageous to search the moves at a node in the search tree in best-first order.

Iterative deepening is the process of first looking 1 move ahead, then starting from scratch searching 2 moves ahead, and so on until you run out of time. This strategy has the advantage that when the time limit expires (30 seconds in the case of the Pousse competition), the program always has a move available to play, namely the best move from the most recently completed iteration. Since the cost of each iteration grows exponentially, little time is lost in searching all iterations up to the last.

In fact, iterative deepening works better than a straight alpha-beta search to a given depth. Recall that alpha-beta search is most efficient if the best move at a node in the search tree is searched first. But what is the best move at a given node? (Isn't that what the search is supposed to determine?) Although it is not possible to predict the best move from a node, a good guess is the best move from the last time this position was searched. During an iteration, this best-move information is stored in Cilk Pousse's hash table, and then looked up during subsequent deeper iterations. For instance, the depth-9 search places best-move information in the hash table to provide a starting move to try on the depth-10 search. This information speeds up the depth-10 search significantly, because alpha-beta pruning causes the search to examine many fewer nodes. We'll further discuss this issue in the section on move ordering.

The basic alpha-beta searching algorithm is inherently serial, using information from the search of one child of a node to prune subsequent children. For Cilk Pousse, we wanted to take advantage of parallelism. But, when children are searched in parallel, it is hard to use information gained from searching one child to prune another. If one looks at an optimal game tree, however, one finds an interesting property: all of the nodes are either full (all of the children are searched) or singular (only one of the children is searched).

Cilk Pousse's parallel search algorithm takes advantage of this property. It first searches what it considers to be the best child, just like serial alpha-beta search. When that child returns, it may be that the alpha-beta algorithm prunes the rest of the children (a so-called beta-cutoff), and the search returns immediately. Otherwise, Cilk Pousse's algorithm speculates that the node is full, and it spawns off all the remaining children in parallel. If one returns with a score that causes a beta-cutoff, the other children are aborted, since their work has been rendered unnecessary. (A similar parallel algorithm was used for our *Socrates parallel chess program.) On 4 processors, we see a speedup of over 3 in the 30-second search.

Some of our opponents adopted the simple strategy of creating four parallel threads, each of which searches 1/4 of the moves at the root. This strategy makes ineffective use of parallelism, since many moves are searched that would be otherwise be pruned in a serial search. In contrast, the recursive parallel search used by Cilk Pousse exposes much more parallelism without searching many nodes that would otherwise be pruned. Cilk's runtime system automatically schedules the abstract computation to make efficient use of the available processors.

Below is some code abstracted from the Cilk Pousse program to illustrate how the parallel search strategy is supported by Cilk's linguistic primitives. Although the code has been stripped of some of the low-level details, but it is almost completely usable as is. It also serves to demonstrate how easy it is to write a complex parallel search algorithm in Cilk. (The principal missing elements are the move-ordering code and checking of hash-table moves.) Pay particular attention to the inlet catch, which contains an abort statement when a beta-cutoff occurs.

cilk int search( position *prev, int depth, int update_move )
{
    position   cur;
    int        bestscore = -INF;
    int        x; 
    int        mv; 
    int        sc; 
    int        cutoff = 0;
    int        aborted = 0;

inlet void catch(int ret_sc, int ret_mv) {
	 ret_sc = -ret_sc;

	 if (ret_sc > bestscore)
	 { 
	      bestscore = ret_sc;
	      if (ret_sc >= cur.beta) {
		aborted = 1;
		abort;
	      }

	      if (ret_sc > cur.alpha) cur.alpha = ret_sc;
	 } 
    }

    x = make_move( prev, &cur, update_move );

    sc = eval(&cur);
    if (abs(sc)>=MATE  || depth <= 0) 
    {
      return( sc );
    }

    cur.alpha = -prev->beta;
    cur.beta = -prev->alpha;

    /* try best move first (and hopefully best move) */

    if (!cutoff)
    {
      sc = spawn search( &cur, depth-1, first_move );
      sync;   /* the sync makes this first move serial */


      sc = -sc;
      if (sc > bestscore)
      { 
	bestscore = sc;
	if (sc > cur.alpha) cur.alpha = sc;
	if (sc >= cur.beta) {
	  cutoff = 1;
	}
      } 
    }


    /* cycle through all the moves */

    if (!cutoff){

      if (!aborted) 
      for (mv=2nd_move; mv<=last_move;  mv++)
	{ 
	  catch(spawn search( &cur, depth-1, mv), mv );
	  if (aborted) break;
	} 
    }
    sync;  /* this sync is done outside the loop to search these
              moves in parallel */


    return( bestscore );  
}

Move ordering

The order in which moves are searched can have a great effect on the number of nodes that must be visited to complete a search. If we search the best node first, we may only have to search that one move, but if we search the best move last, we may have to search all moves. Even if we don't try the best move first, trying a ``good'' move first is still helpful, as the score returned by this move may allow us to decrease the search window (the region between alpha and beta) for future moves.

There are two heuristics we use to improve our move ordering.

The first is that if there is a hash-table move available (see the section on the hash table), we search that move first. This move was the best move last time we visited this position, so it is likely to be so again.

The second heuristic is to keep a table of countermoves. Each time in the search that a move is found which beats beta, we store that move in a table indexed by the opponent's previous move. (There is a separate table for each side.) In effect we are building up a table that says, ``Last time the opponent made move x, move y was a good response.'' When we begin a search of a node, we do a lookup in this table (indexed by the previous move) to find a move. We search this move first if there is no hash-table move (which is most of the time), or second if there is a hash-table move. As a further improvement, it is possible to store the last r good countermoves for each move, and try those moves early in a search. Cilk Pousse actually uses r=2.

Using this countermove table had a profound effect on the speed of the search. On the (admittedly few) positions on which we tested this heuristic, we found that using this reply table gave us a 5 to 10 times reduction in the number of nodes that we needed to visit when searching those positions to a given depth.

The hash table and repetitions

Our program keeps a hash table of searched positions. This standard transposition table is commonly used in chess programs with a few modifications to suit it to Pousse. Briefly, for each position searched, a transposition table stores

  • the move chosen by the search,
  • the score the search returned, and
  • the depth of the search.
Then, each time a search of a position is begun. a hash-table lookup is done to see if that position has been searched before. If so, the hash table information is used for two purposes:
  • Improving move ordering: The hash-table move is used as the first move searched. This is likely to be the best move, and starting with a good move reduces the amount of searching to be done.
  • Pruning the search: If the previous search of this node was ``deep enough,'' then we may be able to use the score and avoid researching the position. (For the Pousse program, we also require that the hashed search result ends with the same side-to-move as does the current search, as side-to-move can have a large (and varying) effect on the value of a position).

The hash function on positions used the bitwise exclusive-or (XOR) of a collection random numbers stored in a table. For each square on the board and for each piece type (X, O, or empty) on the square, the table contains a random number. The hash function is simply the XOR of the table entries for the pieces on the board. Although this function might sound expensive to compute, it can be updated incrementally as moves are made, since XOR its own inverse.

Detection of repeated positions was essential for the game of Pousse, since the side that causes a repeated position loses. Most game-playing programs use their hash table to detect repeated positions. When a position is encountered, it is marked as ``open'' while the search of its descendants proceeds. Once the search on that node is complete, the position is marked as ``closed.'' If the search encounters an open position, the position is repeated. Unfortunately, this simple strategy does not work for parallel search, since it is hard to tell which positions were opened by your ancestor as opposed to by a parallel thread.

To detect repeated positions, Cilk Pousse maintains for every board position a pointer to its parent's position. When evaluating a position, we walk the chain of ancestors from the current position backwards to the beginning of the game and compare to see if the same position already appears in the chain. (We actually check only every second position, since a repeated position must have the same side-to-move.) Each position is labeled with a hash value to make this comparison easy and fast. We hash each position into a 64-bit number and hope that the hash function is perfect, i.e., two different positions do not collide.

An optimization was to stop the scan of the ancestor chain as soon as an irreversible move was found. In the Pousse game, the number of pieces on the board cannot decrease during the game. Thus, any move that does not push a piece off the board is irreversible. The previous position cannot ever appear later in the game. Consequently, there is no point in looking for a repeated move once the search routine finds an irreversible move.

Towards the end of the game, most moves are reversible, and we worried that the scanning of the ancestor chain might require too much time. We had not observed this problem, but on the other hand, we had not played many long games. Consequently, we made a small hash table of all positions already encountered in the actual game, since these are fixed throughout the search. When a position is encountered in the search, a check is first made into this hash table. If the position is not there, then the ancestor chain is scanned as before, terminating at the root of the search tree. Although this modification did not appear to have any effect on the performance of Cilk Pousse, we decided to incorporate it into the final submission because it produced a better guarantee of performance in the endgame.

The repeated position rule causes a major wrinkle in using a hash table for Pousse. Specifically, it can be dangerous to use the score from the hash table. (The ``best move'' in the hash table is always safe to use, since it is only a heuristic suggestion.) The problem is that the score of a position may depend on the path that leads to it. Specifically, if the path contains the position, then the score is a loss for the player on move.

One solution to this problem is just to ignore it and always accept the score, hoping that the benefits from the correct uses of the score outweigh the problems from using incorrect scores. Some chess programs adopt this approach. For Pousse, however, the value of a repeated position is a loss, rather than a draw as in chess, and many repeated positions occur in the endgame. Thus, we decided it would be too dangerous to adopt this approach. The alternative of not using the hash-table scores at all struck us as wasteful, however, because it leads to excessive redundant work.

Our solution was to select a subset of the circumstances in which it is safe to use the score, and only use the score in those circumstances. We made the observation that when an irreversible move is made, no position reached before that move can ever be reached again. Our algorithm for using hash table scores was the following. When at position p:

  • Store a score for p in the hash table only if the move leading to p is irreversible.
  • Look up a hash score for p only if the move leading to p is irreversible.
By ensuring that the move leading to position p is irreversible, we guarantee that the search beginning at p is not (was not) influenced by the positions leading up to p.

There are two optimizations to this algorithm that we used, both of which take advantage of additional circumstances when the moves leading to p are irrelevant.

  • A1. If the lookup move/score beats beta and is an irreversible move, then it is safe to use.
  • A2. If the searched move/score beats beta and is an irreversible move, then it is safe to put into the hash table.
  • B1. If the lookup move/score is a losing score (i.e., doesn't beat beta), and the depth is 1, then use the score.
  • B2. If the searched move/score is a losing score, and the depth is 1, and we do not lose because of a repeat, then it is safe to put the score into the hash table.

Evaluation function

Immediately upon receiving the challenge task, we played Pousse games among ourselves to get a feel for how the game is played and decide what terms to put in our evaluation function. By the end of Thursday evening, we had come up with four terms:

  • Piece count: Each of your pieces is worth a constant number of points. Initially, for an N-by-N board, we assigned the weight 4N to this term, but later, we reduced it to 4.
  • Centrality: Each of your pieces receives a bonus equal to the Manhattan distance from the piece to the nearest corner. That is, a piece on file i and rank j receives a bonus of min(i,N-i) + min(j,N-j).
  • Frank-squared: We coined the term ``frank'' to refer to a file or rank. You score k2 points for each frank having k of your pieces.
  • Connectivity (The final program did not use this term.) Each pair of your pieces that are orthogonally adjacent receives a bonus of 2 points. Each diagonally adjacent pair receives 1 point.
Our initial evaluation function was obtained by summing these terms for Player X and subtracting the analogous sum for Player O.

Once we had a working autotester capable of playing multiple games, we tried changing the relative weights of the terms, as well as adding and removing terms. We ran these tests mostly on 6-by-6 boards and occasionally on 4-by-4, 5-by-5, or 7-by-7 boards. Both sides were run with a fixed-depth search, rather than with a timed search, because we wanted to gauge the benefit of the heuristic before worrying about optimizing performance.

We found that the initial weight 4N for the piece-count term was too large. Reducing it to 0 improved the program. On Saturday night, however, we used a 4-processor Alphaserver to run a minitournament consisting of 3600 games played on a 6-by-6 board. Among 9 programs with different piece values ranging from -4 to 24, we found that a piece-count value of 4 produced a statistically significant advantage, which we then used for our submission. As a simple optimization, we incorporated piece-count into the centrality bonus by building a single look-up table indexed by square containing the sum of the centrality and the piece-count bonuses.

We tried a couple of nonlinear functions for centrality, but they did not help and sometimes hurt. Leaving out the centrality term hurt, however, because the program would tend to lose by tactics in the early game. Apparently, it is important in the opening to favor the center. Later in the game, centrality appears not to be as important. So, we stuck with our original term.

We tried replacing the frank-squared term with either a frank-cubed or frank-to-the-1.5 term, but neither performed better or worse when autotested against the frank-squared term. So, we stuck with our original term.

We found that the connectivity term produced no effect one way or the other, so we removed it altogether. We were glad to get rid of it, as it took a long time to compute.

Over the three days of the competition, we tried changing the evaluation function in many ways, but all we ended up doing was eliminating the connectivity term and tuning the piece-count term. No new evaluation term outperformed the original (and many did worse). So, we just left the evaluation function as it was, and concentrated instead on optimizing it.

Our main techique to optimize performance of the evaluation function is to incrementally update the score rather than to compute it from scratch for each new position. In addition to the position itself, we maintain a count for each frank and player of the number of pieces that that player has in the frank. When a move is made, we update only the frank counts that change. Then, we sum the squares (using table lookup) of the franks. This process changes what would naively be a quadratic-time calculation into a linear-time evaluation. Likewise, we update the piece-count/centrality score incrementally when the move is made by subtracting out the old value of each square that changes and adding in the new. We incrementally update the total number of pieces for each player, which is used for repetition detection. We also incrementally update the hash function by XOR'ing out the old random values for squares that change and XOR'ing in the new values.

Time control

The 30-second time control was implemented with a hack. Cilk Pousse spawns a timer thread that checks the time every few milliseconds, setting a flag at timeout. After spawning the timer thread, it spawns the main search routine. The search routine is not spawned once, however, but many times, due to iterative deepening. The loop spawning the search needs to sync after every search, in order to save the result of the search. In Cilk one cannot wait for a single thread to complete. Instead, the sync statement waits for all threads spawned by the current procedure to complete. In particular, it waits until the timing thread terminates, which is not the behavior we wanted.

During the 72-hour contest, we completely missed the obvious solution to this problem. The obvious solution is to spawn the timer thread and then spawn a separate procedure performing the iterative-deepening loop. In this way, the iterative-deepening procedure can sync without having to worry about the timer thread. Instead, we implemented a complicated protocol, such that whoever returns first (whether the search or the timer) terminates the other thread. The details of the mechanism are in the code. We don't recommend it for light reading.

The timeout was conservatively set to 27 seconds, figuring that 3 seconds was long enough for us to shut down. Chris suggested that we should use 25 seconds, just in case. While nobody else took the proposal seriously, we got a good scare during the tournament. In fact, time control turned out to be the most problematic aspect of Cilk Pousse after the tournament started. The tournament directors reported that our program was systematically running out of time. The program was printing an answer after 27 seconds, but it took as much as 5 more seconds before control was returned to the referee program.

Although the reason for this delay has not yet been completely understood, it appears to be a weird interaction between the Cilk memory allocation mechanism, which uses mmap(), the Linux kernel, and the referee. In short, Linux seems to need a long time to dump all mmap-ed pages to disk (even though those pages are not needed afterwards and could be discarded). This process occurs after the Cilk program exits, but before the program is officially declared dead by Linux. The referee was killing processes created by the Cilk program after it printed an answer, but before page dumping was complete. Apparently, this action by the referee, which involves a tight loop of system calls, interferes with and further delays the dump. The problem disappeared after the tournament directors modified the referee to wait 30 seconds in all cases before killing a game program, even if it prints an answer earlier.

Testing

To improve our program, we implemented an autotester that allowed us to play games between different versions of the program, as well as with a human. We had considerable power at our disposal (several 4-processor 467-MHz Alphaservers, a bunch of 8-processor Sun Enterprise 5000's, and a couple of Pentium Pro multiprocessors running Linux). A first version of the tester was ready by Thursday night of the contest. It enforced the repetition rule, and it also allowed humans to play. By Friday morning, the 30-second timeout rule was also in place. Over the next couple of days, various improvements were made, including the possibility of playing multiple games, which was the ultimate goal of the testing program, and an improved human playing mode that allowed moves to be retracted.

By Friday night, the team working on evaluation functions had implemented several strategies that they wanted to test. The two professors, who were on this team, set up the autotester to run overnight to play hundreds of games. Embarrassingly, the professors forgot to invoke the autotester with the -r option, which randomizes the opening. The autotester played the same games over and over all night, wasting an estimated 67,248,000,000,000 Alphaserver cycles. (It seemed strange in the morning that all games in a match were always won by the same program, or else both sides had won exactly 50 percent.) The Alphaservers were later used effectively to tune the terms of the evaluation function.

On Saturday, we exchanged binaries with Bradley Kuszmaul of Yale, who wrote the Alpha Beta Soupa entry. Bradley was one of the original developers of Cilk when he was a grad student at MIT. He was also coauthor of MIT's *Socrates parallel chess program, which uses a parallel-search algorithm similar to that of Cilk Pousse. We made a gentlemen's agreement not to analyze each others' binaries, but just run them to test our own programs against an adversary other than ourselves. We gained confidence from a few test runs against Bradley's program, but having Bradley's program around may have been more of a distraction than providing any actual help.

Because Bradley is an expert in parallel chess programs and a developer of Cilk, we naturally assumed that he was programming in Cilk. After the contest was over, however, we discovered that he had eschewed parallelism and written his program entirely in C. Perhaps he would have done better (he tied for second, but lost in the playoff) had he used the technology he helped develop. (Serves you right, Bradley, but we still love you.)

One of the difficult bugs that arises in parallel code is that of a data race. A data race occurs when two logically parallel threads, holding no lock in common, access the same location, and at least one of the two threads performs a write. A data-race bug is usually fixed by ensuring the two threads acquire a common lock during their accesses, or else by serializing the computation so that the threads do not operate in parallel. Race bugs are insidious, because they often do not arise during ordinary execution, but can cause serious errors when they do arise.

The Cilk development environment provides a race-detection tool, called the Nondeterminator. By using this tool, we were able to certify with high confidence that no inadvertent race bugs existed in the code. The Nondeterminator executes the Cilk code serially on a given input, using a novel data structure that keeps track of what threads operate logically in parallel. Every read and write by the program is instrumented to see if a data race exists. The Nondeterminator is not a verification tool, since it simulates actual execution on a given input, but it does provide certain guarantees of finding races if they exist.

We intentionally left one race bug in the code. We do no locking on accesses to the hash table. Consequently, a read may think it is accessing the score of a given position, but the hash-table entry may be nonatomically updated by another thread between the read of the position and the read of the score. We determined that the overhead for locking would weaken the program more on average than if we did no locking and risked that a race might occur during the competition that actually affected the outcome. Thus, Cilk Pousse is provably non-bug-free. :)

Conclusion

The 72-hour contest was exhausting, but a great experience for all the team members. We entered the contest to test Cilk technology, and we feel it measured up well. We were the only team with a sophisticated parallel search algorithm, the coding of which was well supported by the Cilk language. As a tribute to the effectiveness of Cilk, most of our time was spent not on parallel programming, but on the other aspects of the task. We indeed benefited from our experience with chess-playing programs, but mainly in that it allowed us to organize a much larger team than would have been possible had we no experience with computer game playing.

We would like to thank the ICFP tournament directors, Marc Feeley and Olin Shivers, for their efforts in running the tournament. They worked tirelessly to get as many programs as possible to run when system problems arose. We are grateful for their dedication to what must have been an onerous task.

For questions, comments, or to negotiate a license for Pousse technology ;-), please contact: cilk-support@supertech.lcs.mit.edu

Last updated: $Date: 1998/10/20 03:17:29 $ by $Author: prokop $