Cilk Plus Solver for a Chess Puzzle or: How I Learned to Love Fast Rejection

Published: 02/14/2013, Last Updated: 02/14/2013

In honor of Valentine’s Day, I’ll note that there is much to be said for fast rejection. It saves time and effort that can be better spent searching elsewhere. In this article I’ll discuss a parallel algorithm for solving a chess puzzle that exploits fast rejection.  It makes a good demonstration of basic Intel® Cilk™ Plus programming to solve an interesting puzzle.

The puzzle is whether a player’s eight chess pieces (excluding pawns) can attack all squares on a chess board, assuming that the two bishops must be on opposite-color squares.   Some others and I published a serial algorithm for the problem in 1989.  The algorithm relies on an interesting rejection test that quickly rejects large portions of the search space.  It places more than eight pieces on the board at once and checks whether all squares are under attack, ignoring the blocking effects of the pieces.   If not, then any subset of those pieces cannot attack all squares. The opening paragraph is worth reading for Skiena’s sly remark about pruning – we were surprised that editors of an academic journal kept it.

The paper notes that the original program took 75 minutes in 1988 on a Sun 3/360. Machines have gotten much faster since then.  I lost the original code, but was able to rewrite a parallel version from scratch, without the one-level look-head mentioned in the paper.  The parallel version can solve the same problem in less than two seconds on a high-end 16-core machine (a two-socket Intel(R) Xeon(R) Processor E5-2670L).  

Here I will explain the parallelization of the algorithm.  I’ll assume that you have already at least skimmed over sections 2-3 of the paper to understand the serial algorithm.   

The code is attached.  It is a single source file.  I recommend reading it top to bottom.  Two macros affect its behavior:

  • Compile with –DPARALLEL=0 to compile as serial code.   
  • Compile with -DBISHOPS_CAN_BE_ON_SAME_COLOR=0 to solve the original problem for which I stated times.

Removing the bishop constraint approximately doubles the work.  I made the harder problem  (unconstrained bishops) the default because enables the program to show some solutions,  and modern machines are fast enough to solve it within my patience limit.   


Parallelizing with Cilk Plus requires only minor changes to the serial code.  The following sections explain these changes.


The algorithm performs recursive divide-and-conquer.  See the paper for details.  Here is the key routine:

 void Search( const Board& b ) {
    if( !b.reject() ) {
        int i = b.chooseAxis();
        if( i<0 ) {
            // Found a weak solution
            if( b.strongAttacks().isAll() )
                // Found a strong solution
                if( !b.hasSuperposition() )
                    // Found solution with no superposition.  Print it.
                    Output << b << std::endl;
        } else {
            // Unfold on axis i and search both halves in parallel
            cilk_spawn Search( Board(b,i,0) );
            Search( Board(b,i,1) );
    // implicit cilk_sync

To parallelize it, I had to indicate that the two recursive calls to Search can run in parallel.  To do this, I prefixed the first call with cilk_spawn, which says that the caller can keep on going without waiting for the callee to return.   I could have also inserted a cilk_sync after the two calls, which would say to wait until the spawned callee returns, but I didn’t since it would be redundant in this example.  Cilk Plus always has an implicit cilk_sync at the end of a routine.


There is more parallel magic in routine Search than meets the eye.  Note the two lines “++WeakCount;” and “Output << b << std::endl;”.  Both line operate on global variables unprotected by locks.  If I were writing ordinary multithreaded code, these lines would almost surely lead to missing updates to WeakCount and non-deterministic output.  But the program is deterministic because I declared WeakCount and Output as reducers.

Reducers  are Cilk Plus objects for which different threads get different “views”, and the views are automatically merged in a way to deliver the same result as the equivalent serial program.  The views of WeakCount are partial sums that are automatically added together to get the correct total.  The program checks that the total matches the value (8715) reported in the paper (bishops constrained).  The reducer Output acts like a std::ostream, except that it cleverly merges partial output such that the final output is identical to what the serial version of the program prints.


The code scales well because it has a lot of parallel slack (excess available parallelism) and is not memory intensive.  If you have Cilk Plus on your system, I invite you to time the serial versus parallel versions of the code.  Here are the recommended command lines for compiling it with the Intel compiler on Linux* or Windows* using the Intel compiler:

Linux Windows
serial icc -O2 -xHost chess-cover.cpp –lrt –DPARALLEL=0 icl /O2 /QxHost chess-cover.cpp /DPARALLEL=0
parallel icc -O2 -xHost chess-cover.cpp –lrt icl /O2 /QxHost chess-cover.cpp

The options presume that your compiler paths are set up for using TBB, which I used for its portable wallclock timing facility.  The option -xHost tells the compiler to optimize for the host machine processor .  Using it gained me about a 15% improvement.  Theoretical analysis of the program’s parallel speedup requires two numbers: 

  • Work: The total number of instructions executed. 
  • Span: The number of instructions on the critical path.

The ratio work/span is a formal measure of parallelism in the program.  For example, if work=span, the parallelism equals one; that is the program is serial.

My program does relatively little work (only 1,832 instructions on average) between fork/join actions, so it is better to use something called “Burdened Span”, which accounts for synchronization overheads.   

Since leaves in the search tree have different depths, estimating the span is a bit tricky.  So I let the Cilk view scalability analyzer (you can get it from the Intel(R) Cilk(TM) Plus SDK at  do the work for me.  It reports the following statistics for solving the problem with unconstrained bishops:    Work :                                162,169,367,032 instructions    Span :                                58,705 instructions    Burdened span :                       1,123,705 instructions    Parallelism :                         2762445.57    Burdened parallelism :                144316.67    Number of spawns/syncs:               129,851,246    Average instructions / strand :       416    Strands along span :                  43 A strand is a sequence of serial instructions between synchronization operations (spawn or sync operations). The average strand is only 416 instructions. Given that small amount, Cilk Plus’s low-cost fork/join is helpful. The “burdened parallelism” shows the ratio “work”/“burdened span”. The value indicates that this program can theoretically scale to 100 thousand hardware threads on an ideal machine. Of course it is just an estimate for an ideal machine. However, I’ve seen the program speed up by 28x on a real 40-core machine.


Intel Cilk Plus enabled speeding up the puzzle solver with a few changes, and the resulting program scales well and behaves deterministically.

For more information about Intel Cilk Plus, see the website  For questions and discussions about Intel Cilk Plus, see the forum

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804