# Parallel Computation of Sparse Rulers

This article explains the sparse ruler problem, two parallel codes for computing sparse rulers, and some new results that reveal a surprising "gap" behavior for solutions to the sparse ruler problem. The code and results are included in the attached zip file.

## Background

A complete sparse ruler is a ruler with M marks than can measure any integer distance between 0 and L units. For example, the following ruler has 6 marks (including the ends) and can measure integer distance from 0 to 13:

Measuring a desired distance with this kind of ruler sometimes requires not starting at either end. For instance, 8 units is measured by using the marks at 2 and 10. See the Wikipedia article on the subject for a more detailed introduction and more examples. As Peter Luschny's web page explains, a sparse ruler is perfect if there is no complete ruler of the same length with fewer marks, and optimal if no longer complete ruler with the same number of marks exists.

A recent Al Zimmerman Programming Contest problem (see "Graceful Graphs) asked a problem related to sparse rulers. Alas there turned out to be a shortcut known as Wichmann rulers that trivialized the contest. The first two contestants to use Wichmann's formula graciously declined the prize since the solutions were trivial. However, it is only conjectured, not proven, that Wichmann rulers are optimal for M>13. Hence it seemed worth exploring perfect rulers longer than those currently known to see if the Wichman formula could be beaten. No counter example to the conjecture was found, but there was an interesting "gap" result, explained later in this article.

The programs to be discussed solve the following problem:

For a given L and M, find all complete sparse rulers, assuming that the solutions will be perfect sparse rulers.

The "perfect" assumption avoids the need to waste time finding the number of ways that redundant marks could be added. The programs print out the rulers and a count of the number of rulers found. Mirror images of the perfect rulers are not printed since they are obvious, but are included in the count.  Peter Luschny has a table of known perfect-ruler counts.

## Algorithm

I wrote two algorithms for the problem. Both employ a similar recursive search technique, but differ in their representation of a set of distances. I'll explain the algorithms in the abstract, where they are identical, and then dive into details. The key recursive routine is called Scrunch because of the way it works incrementally inwards from the ends of the ruler towards the center. It takes four parameters:

• A partially marked ruler
• avail, the number of remaining marks that are available to inscribe on the ruler
• w, the width of regions near both ends that have already been considered for marking and excluded from further marking.
• uncovered, the set of distances not yet covered by the partially marked ruler

For example, consider a ruler where M=8 and L=23. The following picture shows the parameters to Scrunch part way through the search. The red positions indicate where the avail remaining marks can be inscribed.

Given this scenario, routine Scrunch considers four possibilities:

• inscribe a mark at position 5.
• inscribe a mark at position 18.
• inscribe marks at both positions
• Do not inscribe marks in either position.

It then widens the green regions one unit and recursively searches with the new partial ruler. Eventually the recursion stops for one of four reasons:

• A complete ruler is found.
• There are more distances in uncovered than can be generated by adding avail marks. Adding avail marks can only generate avail*(avail-1)/2 distances between themselves and avail*(M-avail) distances between themselves and the existing marks.
• There is not enough remaining space to fit avail marks.
• There is a long distance in uncovered that cannot possibly be covered by adding another mark in the white region.  The longest distance created by a new mark is bounded by d<L-w, since at most the new distance spans from the new mark to one of the ends of the ruler.

The second and third reasons are instances of the pigeon-hole principle.

The top level call to Scrunch is with a ruler that has only its endpoints marked. That is avail=M-2, w=0, and uncovered = {1..L-1}. Here is a sketch of routine Scrunch:

Scrunch( avail, w, ruler, uncovered ) {

u = number of distances in set uncovered.

If u>0 then

If uavail*(avail-1)/2+avail*(M-avail) then

return // Too many distances to cover

While there exists enough remaining space to hold avail marks do

w = w+1

Try putting mark(s) at position(s) w and/or L-w.  Recursively call Scrunch for each case.

If distance L-w is not covered yet

return  // Long distance not covered

Else

Report a complete ruler.

}

The "Try" part tries only three of the four possibilities previously mentioned. The fourth possibility (adding no marks) is done by looping back. The loop is essentially a hand-optimized tail-recursion for the fourth possibility, which avoids rechecking loop-invariant conditions that a fully recursive version would recheck.

The previously stated pigeon-hole termination tests are algorithmic speedups.  Some constant-factor improvements were worthwhile too. First, not explicit in the sketch, is avoiding redundant work for symmetrical solutions. The actual code avoids searching for a ruler and its mirror image.

Second, the algorithm is amenable to parallel execution. Each sub-search case is independent, and hence can proceed in parallel. The while loop does not even have to wait for the cases to finish, so it is free to continue looping even while these cases are in flight.  Given ideal parallel hardware, the parallelization provides asymptotic speedup, reducing the execution time down to thelength of the critical path.  In practice, parallel hardware is finite, so the parallelization yields a constant-factor improvement. But I had access to big hardware, and thus a big constant, as described later.

Third, operations on set uncovered should be fast, since that is where most of the time is spent computing the new uncovered for each recursive call. The next two section address two ways to represent uncovered.

## Bit Mask Representation of Ruler

Tomas Sirgedas first outlined this approach in the contest discussion group. The key is representing the ruler as two half-rulers. Each half-ruler is represented as a bit mask. The set uncovered is represented as a bit mask too, and "population count" instructions are used to get its size. Tomas's clever observation was to flip one of the half-rule masks, so that bit positions were relative to counting inwards from the ends. The flip allows uncovered to be updated quickly using bitwise operations. In each half-ruler mask, bit k corresponds to a mark at k units inwards from the end. The two halves are called near and far in the code. When a mark is tentatively placed, the half it was placed on has the role of near, and the other half is far.

The mask representation of uncovered consists of L+1 bits, partitioned between two submasks lo and hi.

• lo holds distances in the half-open interval [0,ceil(L/2)). Bit k corresponds to distance floor(L/2)-k.
• hi holds distances in the closed interval [ceil(L/2),L]. Bit k corresponds to distance L-k.

Given halves near and far of a ruler and a new mark at position w, set uncovered can be updated with shift and bitwise operations.

```
lo &= ~(near << (L/2-w))
lo &= ~(far >> (L-L/2-w))
hi &= ~(far << w)

```

The following drawing shows an example of the bit correspondence. Each bit mask is shown as a shaded box. The rightmost mark is implicitly bit 0, and ascending bit positions go leftwards. Numbers on the ruler indicate mark positions. The near half of the ruler is shown flipped. The partial ruler has marks at {0,1,4,17,21,23} and a mark at 6 (red) is being added. The red lines shows how the new distances induced by the mark at 6 are computed by shifting the half-ruler masks. For example, the mark at position 21 is at distance 15 from the new mark at 6, so the corresponding bit in uncovered has to be cleared.

The influence of far on lo becomes a right shift in the code since it is first being effectively shifted left by a half-ruler's length. The particular point at which uncovered is split was chosen to simplify the update: the near part never affects hi.

An additional benefit of flipping half the ruler is that is makes symmetry testing cheap.  That test is part of avoiding redundant work.

The file bitscrunch.cpp has two implementations of the key computations. Both use the intrinsic _mm_popcnt_u64 to compute the size of set uncovered. The differences in the implementations are:

• One uses mostly standard C++ operations for shifts and bitwise operations. It limits a half-ruler's length to the number of bits in a `size_t`.
• The other uses SSE to do 128- bit operations, and thus can operate up to L=255

The SSE version is enabled by default. Change the #if 0 in the source to #if 1 to use the other one.

## An Asymptotically Faster But Practically Slower Approach

For perfect rulers, L is quadratic in M, so the bit mask approach causes the key operations on uncovered to take time O(M2). Given that the contest involved problems of up to M=97, I initially steered away from using bit masks. My first algorithm (see file scrunch.cpp) used a byte array to represent uncovered. Each element uncovered[k] was 1 if element was present, and 0 otherwise. When adding a mark, the array could be updated in O(M) time for the O(M) new distances created by the new mark, and the size of the set could be maintained incrementally. Alas once the contest devolved into "beat Wichmann", it became clear that M was going to be no more than about 25, and so the constant factor advantages triumphed. However, my first algorithm has no limitations due to word sizes, so I used it to polish off some problems with L beyond the limits of my bit mask algorithm. For these problems, M was much too small relative to L for there to be solutions, so the search ran fast enough to get the desired negative results.

The recursive method scrunch in scrunch.cpp is essentially the same as previously discussed in the abstract, but is harder to follow because each level of recursion extends only one green zone at a time, alternating between left and right as it recurses.

## Parallelization

The code is parallelized with Intel® Cilk™ Plus, which excels at recursive parallelism. Recursive calls are spawned using cilk_spawn. Like many parallel recursive codes, there is a point at which the parallel scheduling overheads become serious drag, so the code uses serial recursion when avail<Cutoff. The following slightly simplified excerpt from bitscrunch.cpp shows the key logic. Macro SCRUNCH handles the cutoff logic.

```
template<typename T>

T Clone( const T& x ) {return x;}

#define SCRUNCH(a,w,l,r,u) do {
if( (a)>=Cutoff ) cilk_spawn Scrunch(a,w,Clone(l),Clone(r),Clone(u));
else Scrunch(a,w,l,r,u);
} while(0)

static void Scrunch( int avail, int w, const HalfRuler& left, const HalfRuler& right, const UncoveredSet& uncovered ) {
if( auto u = uncovered.count() ){
if( u > MaxUnique[avail] )
return;                                      // Too many distances to cover
while( 2*w < L-avail ) {
++w;
UncoveredSet ul(uncovered);
ul.strike( left, right, w );                 // Remove distances covered by mark at w on left half
UncoveredSet ur(uncovered);
ur.strike( right, left, w );                 // Remove distances covered by mark at w on right half
HalfRuler lw(left,w);                        // Left half with mark at w
HalfRuler rw(right,w);                       // Right half with mark at w
if( 2*w!=L ) {
SCRUNCH( avail-1, w, lw, right, ul );    // Try w on left
if( left!=right )
SCRUNCH( avail-1, w, left, rw, ur ); // Try w on right
if( avail>=2 ) {
UncoveredSet ub(ul&ur);              // Try w on both sides
ub.remove(L-2*w);
SCRUNCH( avail-2, w, lw, rw, ub );
}
} else {
Scrunch( avail-1, w, lw, rw, ul&ur );    // w is centerpoint
break;
}
if( uncovered.hasFar(w) )
break;                                   // Long distance not covered
}
} else {
TallyRuler(avail, left, right);                  // Found a complete ruler
}
}

```

Function template Clone solves some lifetime issues. When a function is spawned, any compiler-generated temporary objects for the call live until the spawnee returns. However, some of the parameters are references to local variables whose lifetime is limited to the body of the while loop. Because the while loop is allowed to proceed while the spawned calls are still running, it is essential to copy those local variable with function Clone. An alternative would have been to pass all parameters by value to Scrunch, but experiments indicated that was slightly slower.

Macro SCRUNCH must be a macro, not an inline function, because Cilk is fully strict, meaning that a function cannot return until all its spawnees have finished. An alternative, as in the language X10, is terminally strictsemantics where a function can return with spawnees still running. Though fully strict semantics simplifies reasoning about programs and simplifies implementation of Cilk, here terminally strict semantics would have been more convenient.

The lack of any synchronization in the while loop makes the "continuation-steal" semantics of Cilk essential for bounding memory consumption. In a "child-steal" system (such as TBB or OpenMP's "tied tasks"), idle worker threads steal the spawned tasks, so the current thread would execute the loop to completion. Doing so could pile up a lot of tasks before any executed, whenever the rest of the workers are busy, which should be the common case. In a "continuation-steal" system such as Cilk, the current thread dives immediately into the spawned task, and leaves the continuation of the caller to be stolen by an idle worker. That way the while loop self-throttles. It advances only when an idle worker becomes available.

### Determinism

The programs are deterministic, despite having no locks! The programs achieve this by using Cilk Plus reducers. Each reducer provides multiple views. Each parallel strand of execution sees a different view and thus each strand can update its view without interfering with other strands. The Cilk Plus run-time automatically merges views as appropriate.

Here are the declarations of the two reducers:

```
// Number of rulers found.  Asymmetric rules are counted twice.

// Output stream
static cilk::reducer_ostream Out(std::cout);

```

Each reducer acts like a "pointer to view". Here's the code in TallyRuler that updates Count and Out.

```
// Report a sparse ruler solution.
void TallyRuler( int avail, HalfRuler left, HalfRuler right ) {
*Out << "Ruler[" << M-avail << "]:";
// Print marks in sorted order.  Algorithm is quadratic, but takes relatively small
// effort compared to the exponential effort used to compute the marks.
for( Mark i=0; i<=L; ++i )
if( i<=HalfL && left[i] || RestL<=i && right[L-i] )
*Out << " " << unsigned(i);
*Out << std::endl;
*Count += 1 + (left!=right);
}

```

Reducers work for any associative operation, because associativity lets the run-time reassociate updates and merging of views. The Output stream operations are associative because output is partially buffered, and concatenation of buffers is an associative operation. For the full theory behind reduces, see "Reducers and Other Cilk++ Hyperobjects".

### Scaling

For large ruler problems, the code scales well, which is not surprising given that:

• There is much more available parallelism than the hardware can exploit ("parallel slack"). The slack permits Intel® Cilk™ Plus's work-stealing scheduler to automatically balance the load.
• The working set is tiny. Thus memory bandwidth does not become the resource bottleneck.

Here's a quick primer on measuring scaling. The execution time of a program running with P workers (hardware threads) is denoted TP.. The serial execution time is T1. The speedup is T1/TP. It's called relative speedup if T1 is the parallel program running on 1 worker, and absolute speedup if T1 is the time of the best known sequential program, which is not paying any parallel overhead or having to restrict itself to a parallelizable algorithm. The efficiency of a parallel program is speedup divided by P. It also comes in relative and absolute flavors, depending upon which kind of speedup is being used. An absolute efficiency of 1.0 is ideal, and generally hard to achieve.

I ran the programs on an SGI UV Altix 1000 that has 256 cores.  Each core has a single hardware thread.  The following graph shows the efficiency of bitscrunch for M=21 L=153 on that machine:

The graph shows that the absolute efficiency of the program is within a factor of 0.8 of ideal, even when scaled out to 256 workers! Note the cut off vertical axis. The "cliff" is really quite small -- the relative efficiency is within 4% of perfect relative scaling.

My longest run, for M=25 and L=208, took almost 51 hours using 256 cores.

## Inlining

Alas many programmers overuse C++'s inline feature. For example, it is often convenient to define member functions (not just declare them) inside a class, but doing so implicitly marks them inline. Doing too much inlining can hurt performance by increasing instruction-cache misses. Consequently modern compilers employ heuristics that consider the programmer's requests for inlining, but do not always obey them. Profiling an earlier version with Intel® VTune™ indicated that some key routines were not being inlined by icc that I really did want inlined. Hence the code has #pragma forceinline in places to force inlining. Similarly, the code has an explicit inline keyword (technically redundant) for method UncoveredSet::count, which enticed icc into some desired inlining.

## Ruler Results -- A Surprising Gap

The attached zip file [last updated 2014-Feb-10] contains a table SparseRulerCounts.pdf with results of the programs. It goes up to M=25. The Count column indicates the number of rulers found for a given M and L. Most of the results were generated by bitscrunch.cpp. Results for L>=250 were generated with scrunch.cpp.

The table confirms that Wichmann's formula produces the only optimal rulers for 13<M<=25.

The new discovery is the existence of gaps in the table: there are values of M and L for which a perfect ruler can be built, but not for M and L-1. For example, part of the table looks like this:

The rows in yellow were confirmed by Tomas Sirgedas with his independently written program. (He showed remarkable patience, as he was running on 4 cores, not a 256 cores.) The gap, when present, seems to grow with increasingly large rulers. It suggests that the Wichmann formula, and a "off by one" variant, not only beat the other solutions, but peel away from the other solutions as the problem size gets larger.

## Acknowledgments

Thanks to Tomas Sirgedas for suggesting the bit-mask flipping trick and verifying some of the gaps. He and Neal Faiman reviewed an earlier draft of this article. Thanks to the parallel run-time teams for letting me borrow the 256-core SGI UV Altix on the weekends. I felt like a teenager borrowing the family car for a spin.