Multicore-enabling the N-Queens Problem Using Cilk++

Submit New Article

Published On :   October 29, 2009 1:00 AM PDT
Rate
 


by Justin Zhang

I was an intern at CilkArts this summer and my first engagement with Cilk++ was to solve the N-Queens problem with it. The N-Queens problem asks this question: given an n-by-n chessboard and n queens, is it possible to place the n queens on the chessboard without any conflict? Is so, how many different ways can they be placed? An arrangement is said to be conflicting if at least two queens conflict with each other. Two queens conflict if they share the same row, column, or diagonal.  For example, the following is a solution for eight queens:

 

Wikipedia provides an excellent overview of the problem.

Algorithm

The classical solution to N-Queens problem is brute force enumeration with pruning. Brute force enumeration tries every possible arrangement of the n queens and checks if any of them satisfies the non-conflicting criteria.  For example, there are n^2 locations on the board, so there are n^2 possible locations for the first queen, n^2-1 for the second, and so on. Thus we have choose(n^2, n) = (n^2)!/((n^2-n)! * n!) possible arrangements of queens.  By enumerating all of these possibilities and checking each of them, all solutions can be found. For n=10, this number is ~1.73e13.

Because there could be only one queen per row in a non-conflicting arrangement, we can constrain the first queen to locations in the first row, the second queen to the second row, and so on. This will give n possible locations for each queen, resulting in n^n possible arrangements. For n=10, this is 1e10. Observe that there could be one queen per column as well; the i-th queen will have only at most n-i+1 valid locations, because the previous i-1 queens already took i-1 columns. The number of possibilities is then reduced to n!, or 3.6e6 for n=10.

The final point to reduce enumerations is to prevent queens from sharing a diagonal. While this is straightforward to code, it is hard to provide an actual estimate of the number of enumeration after using it. The following is a pseudo-code for this enumeration plus pruning algorithm:

queens=[];     //locations of already put queens
try (int i) //put i-th queen
{
if (i>n)
{
found_a_solution;
return;
}
for j=1..n, //possible locations of i-th queen in i-th row
{
if (putting i-th queen at location (i,j)
does not conflict with previously put queens)
{
queens[i]=j; //store location of i-th queen
try(i+1); //put (i+1)-th queen
}
}
}

The complexity of this algorithm is upper-bounded by n!.

Implementation

I implemented the above algorithm in C++.  Storing the queens array variable is a bit tricky. There were three choices before me: an STL vector, an array, or pointers. There are definitely other ways this could be implemented, such as using an array for space and bookkeeping pointers manually, but here I only considered these three basic cases. They each have their pros and cons:

STL Vector Array Pointers, maintain a cactus stack

All three choices were implemented. Before multicore-enabling them, their serial versions were built first. After all, it's always good to have the serial version of a parallel implementation at hand. In the following, the black colored code is the serial version, and the modification to parallelize them with Cilk++ are highlighted in blue.

Parallelizing a piece of code with Cilk++  is easy. What I mainly needed to do was to insert cilk_spawn before every recursive call, and cilk_sync before I used the return value from such calls. Also I needed to remove any data races. The two main tools I used were Cilkscreen (a data race detector), and reducers (race-free global variables provided with Cilk++).

The following are the three implementation. (The STL version is illustrated below, and the other two implementations are available by clicking on the links.)

STL Vector Version
int n;   // # of queens                                                                                                                                   
int sn; // # of solutions
typedef std::vector<int> ilist_t; // queens array storage

//# of solutions, reducer
cilk::hyper_ptr<cilk::reducer_opadd<int> > count;


// qs: current queens. qs[i] is the column position of i-th queen
// newq: new queen column position
bool check(ilist_t qs, int newq)
{
int i, num=qs.size();
for (I = 0;I < num; i++)
{
int p = qs[i];
if (p == newq //same column
|| abs(p-newq) == num - i) //diagonal conflict
return false;
}
return true;
}

// naive way to clone a ilist_t
ilist_t clonelist (ilist_t *f)
{
ilist_t r; r.clear();
for (int i=0;isize();i++)
r.push_back((*f)[i]);
return r;
}

// search
// queens: already put queens
// row: row of the new queen to be put
void trysol(ilist_t* queens, int row)
{
if (row>=n) //got a solution
{
sn++;
(*count)++;


return;
}
for (int i=0; i<n; i++)
{
if (check(queens,i))
{
ilist_t* pnewl=new ilist_t(*queens);
pnewl->push_back(i);
cilk_spawn
trysol(pnewl,row+1); //pass it as a pointer
}
}
cilk_sync;

delete queens;
}
  • The Pointer Version (click to view code)
    (This program maintains a cactus stack made of nodes dynamically allocated. It uses function return values, instead of a global variable, to keep the number of solutions.)

Results

All three versions give correct results. Following is a list of the number of solutions for each n from 4 to 14.

n       # of solutions
4 2
5 10
6 4
7 40
8 92
9 352
10 724
11 2680
12 14200
13 73712
14 365596

Here the rotation or mirror of a solution is counted as a different solution.

Performance

The time these three programs spent on n=13 is summarized in Table 1 and Figure 1.

(seconds)

1 core

2 cores

3 cores

4 cores

STL vector

4.171

2.453

3.000

3.062

Array

2.312

1.296

1.390

1.375

Pointer

5.594

3.172

3.672

3.547

Table 1: Performance using the standard Windows memory manager

 

Figure 1: Performance using the standard Windows memory manager

Apparently all three methods encounter a bottleneck at 3 cores. By analysis, we found that all three implementations use malloc in the inner loop. The Windows malloc acquires an operating lock - and this can result in significant overhead.

At Cilk Arts, we have developed an alternate memory manager for Windows named "Miser," to remove this bottleneck. Table 2 and Figure 2 show the performance with Miser. As we can see in Figure 3, using the Miser memory manager the program achieves almost perfect speed-up.

(seconds)

1 core

2 cores

3 cores

4 cores

STL vector

3.718

1.875

1.265

0.953

Array

2.235

1.687

0.782

0.578

Pointer

5.062

2.625

1.734

1.500

Table 2: Performance using the Miser memory manager


Figure 2: Performance using the Miser memory manager


Figure 3: Speedup relative to 1 core (using the Miser memory manager)

Work/Span Measurement

It can be quite informative to see the work and span of a program, so I performed work and span measurement on the three implementations. (Charles gives a brief review of the terms in this blog post.)

There are two tools available to measure the work and span. One is a "sim-time" instrumentation tool (dubbed "Cilkometer" for the time being). It reads meta data inserted into the executable by the Cilk++ compiler and analyze the instruction flow of the executable offline, outputting a work/span estimate in number of instructions. It is useful for accurate analysis of the instruction flow.

The other tool is a real-time instrumentation tool. It uses function calls inserted into the executable. These functions keep track of the time spent in each stack frame, outputting a work/span estimation based on the actual execution of the program. The result is given in number of cycles. This latter tool is useful for estimating the actual performance of program.

They each have their pros and cons:

"Sim-time" instrumentation
  • Pro: stable, always gives the same number
  • Con: may vary from actual performance
Real-time instrumentation
  • Pro: measures actual performance
  • Con: the result can change from run to run

Tables 3 and 4, and Figure 4 summarize the instrumentation result on the STL vector implementation. For the sim-time instrumentation, it is measured during serial execution. For the real-time instrumentation, it is done on 4 cores on an Intel Core2 Quad 2.4Ghz machine.


work (instructions)

span
(instructions)

parallelism

number of strands

n=4

579358

542326

1.0683

48

n=6

1140903

564009

2.0228

454

n=8

10192604

590459

17.2622

6078

n=10

196482482

624600

314.5733

105892

n=12

5514385181

676080

8156.4093

2554366

Table 3: Sim time instrumentation results on nQueens STL vector implementation version



work (cycles)

span (cycles)

parallelism

number of strands

n=4

262494

205596

1.28

33

n=6

1640781

269217

6.09

305

n=8

17259462

344457

50.11

4113

n=10

232806402

381951

609.52

71077

n=12

5300370828

950364

5577.20

1712377

Table 4: Real-time instrumentation results on N Queens STL vector implementation version


Figure 4: Measuring Work, Span, Parallelism, Strands

First we see that the results in both cases agree with each other almost perfectly. Second, the work grows exponentially as n increases, while span only grows linearly, resulting in exponentially growing parallelism (in other words, this problem lends itself very well to multithreading).

Summary

The Cilk++ language is studied using the N-Queens problem. The N-Queens problem can be solved by brute force search and pruning.  I implemented the algorithm with several variations and tested them thoroughly. The result shows that performance is hampered by the Windows memory manager, but was excellent once Miser (a customized memory manager) replaced the Windows one. The instrumentation results, both offline and online, shows exponentially growing work and linearly growing span. Parallelizing my program with Cilk++ was straightforward; the change to the code was minimal, the speed up with the memory manager is great, and the instrumentation results provide insight into the amount of work, span, and parallelism in a particular program.

COMMENTS

You left out the most efficient solution. Use bits instead of arrays to determine whether or not the queens conflict. Then instead of looping you can just do a couple bit-wise ands to check a given solution.

posted @ Friday, September 05, 2008 9:45 PM by Pete


Hi Pete, 
You are right - using bit operations would be much more efficient. When I implemented the solutions I was not aware of this approach. (Charles 
mentioned it to me later, and I didn't have time to implemented it yet.) Would you be interested in writing a bit operation implementation, which 
we could "Cilkify" and run on the same machine to compare the performance? 

Justin 

posted @ Monday, September 08, 2008 4:33 PM by Justin Zhang


Actually, already had one written, have fun.

posted @ Monday, September 08, 2008 9:53 PM by Pete


http://therandombit.blogspot.com/2008/09/n-queens-implmented-with-bits.html

posted @ Monday, September 08, 2008 9:54 PM by Pete

In the STL-Example above there are two major errors, probably they were introduced while converting into HTML: 



| int i, num=qs.size(); 

| for (I = 0;I < num; i++) 



The "I" need to be lower-case "i". 



| ilist_t r; r.clear(); 

| for (int i=0;isize();i++) 



it shall be "i (lt) f.size()" instead. 





HTH, MFG

posted @ Tuesday, October 06, 2009 2:39 AM by Heiko Studt