# Parallel Computation of Sparse Rulers

Представлено: Arch D. Robison (Intel), опубликовано: 14 января 2014 г.

This article explains the sparse ruler problem, two parallel codes for computing sparse rulers, and some new results that reveal a surprising "gap" behavior for solutions to the sparse ruler problem. The code and results are included in the attached zip file.

## Background

A *complete *sparse ruler is a ruler with *M* marks than can measure any integer distance between 0 and *L* units. For example, the following ruler has 6 marks (including the ends) and can measure integer distance from 0 to 13:

Measuring a desired distance with this kind of ruler sometimes requires not starting at either end. For instance, 8 units is measured by using the marks at 2 and 10. See the Wikipedia article on the subject for a more detailed introduction and more examples. As Peter Luschny's web page explains, a sparse ruler is *perfect *if there is no complete ruler of the same length with fewer marks, and *optimal *if no longer complete ruler with the same number of marks exists.

A recent Al Zimmerman Programming Contest problem (see "Graceful Graphs) asked a problem related to sparse rulers. Alas there turned out to be a shortcut known as Wichmann rulers that trivialized the contest. The first two contestants to use Wichmann's formula graciously declined the prize since the solutions were trivial. However, it is only conjectured, not proven, that Wichmann rulers are optimal for *M*>13. Hence it seemed worth exploring perfect rulers longer than those currently known to see if the Wichman formula could be beaten. No counter example to the conjecture was found, but there was an interesting "gap" result, explained later in this article.

The programs to be discussed solve the following problem:

For a givenLandM, find all complete sparse rulers, assuming that the solutions will be perfect sparse rulers.

The "perfect" assumption avoids the need to waste time finding the number of ways that redundant marks could be added. The programs print out the rulers and a count of the number of rulers found. Mirror images of the perfect rulers are not printed since they are obvious, but are included in the count. Peter Luschny has a table of known perfect-ruler counts.

## Algorithm

I wrote two algorithms for the problem. Both employ a similar recursive search technique, but differ in their representation of a set of distances. I'll explain the algorithms in the abstract, where they are identical, and then dive into details. The key recursive routine is called *Scrunch * because of the way it works incrementally inwards from the ends of the ruler towards the center. It takes four parameters:

- A partially marked ruler
*avail*, the number of remaining marks that are available to inscribe on the ruler*w*, the width of regions near both ends that have already been considered for marking and excluded from further marking.*uncovered*, the set of distances not yet covered by the partially marked ruler

For example, consider a ruler where *M*=8 and *L*=23. The following picture shows the parameters to *Scrunch *part way through the search. The red positions indicate where the *avail *remaining marks can be inscribed.

Given this scenario, routine *Scrunch *considers four possibilities:

- inscribe a mark at position 5.
- inscribe a mark at position 18.
- inscribe marks at both positions
- Do not inscribe marks in either position.

It then widens the green regions one unit and recursively searches with the new partial ruler. Eventually the recursion stops for one of four reasons:

- A complete ruler is found.
- There are more distances in
*uncovered*than can be generated by adding*avail*marks. Adding*avail*marks can only generate*avail**(*avail*-1)/2 distances between themselves and*avail**(*M*-*avail*) distances between themselves and the existing marks. - There is not enough remaining space to fit
*avail*marks. - There is a long distance in
*uncovered*that cannot possibly be covered by adding another mark in the white region. The longest distance*d*created by a new mark is bounded by*d*<*L*-*w*, since at most the new distance spans from the new mark to one of the ends of the ruler.

The second and third reasons are instances of the pigeon-hole principle.

The top level call to *Scrunch *is with a ruler that has only its endpoints marked. That is *avail*=*M*-2, *w*=0, and *uncovered *= {1..*L*-1}. Here is a sketch of routine *Scrunch*:

Scrunch(avail,w,ruler,uncovered) {

u= number of distances in setuncovered.If

u>0 thenIf

u>avail*(avail-1)/2+avail*(M-avail) thenreturn

// Too many distances to coverWhile there exists enough remaining space to hold

availmarks do

w=w+1Try putting mark(s) at position(s)

wand/orL-w. Recursively callScrunchfor each case.If distance

L-wis not covered yet

return// Long distance not coveredElse

Report a complete ruler.

}

The "Try" part tries only three of the four possibilities previously mentioned. The fourth possibility (adding no marks) is done by looping back. The loop is essentially a hand-optimized tail-recursion for the fourth possibility, which avoids rechecking loop-invariant conditions that a fully recursive version would recheck.

The previously stated pigeon-hole termination tests are algorithmic speedups. Some constant-factor improvements were worthwhile too. First, not explicit in the sketch, is avoiding redundant work for symmetrical solutions. The actual code avoids searching for a ruler and its mirror image.

Second, the algorithm is amenable to parallel execution. Each sub-search case is independent, and hence can proceed in parallel. The while loop does not even have to wait for the cases to finish, so it is free to continue looping even while these cases are in flight. Given ideal parallel hardware, the parallelization provides asymptotic speedup, reducing the execution time down to thelength of the critical path. In practice, parallel hardware is finite, so the parallelization yields a constant-factor improvement. But I had access to big hardware, and thus a big constant, as described later.

Third, operations on set *uncovered *should be fast, since that is where most of the time is spent computing the new *uncovered* for each recursive call. The next two section address two ways to represent *uncovered*.

## Bit Mask Representation of Ruler

Tomas Sirgedas first outlined this approach in the contest discussion group. The key is representing the ruler as two *half-rulers*. Each half-ruler is represented as a bit mask. The set *uncovered* is represented as a bit mask too, and "population count" instructions are used to get its size. Tomas's clever observation was to flip one of the half-rule masks, so that bit positions were relative to counting inwards from the ends. The flip allows *uncovered* to be updated quickly using bitwise operations. In each half-ruler mask, bit *k* corresponds to a mark at *k* units inwards from the end. The two halves are called *near* and *far* in the code. When a mark is tentatively placed, the half it was placed on has the role of *near*, and the other half is *far*.

The mask representation of *uncovered* consists of *L*+1 bits, partitioned between two submasks *lo* and *hi*.

*lo*holds distances in the half-open interval [0,ceil(*L*/2)). Bit*k*corresponds to distance floor(*L*/2)-*k*.*hi*holds distances in the closed interval [ceil(*L*/2),*L*]. Bit*k*corresponds to distance*L*-*k*.

Given halves *near *and *far *of a ruler and a new mark at position *w*, set *uncovered* can be updated with shift and bitwise operations.

lo &= ~(near << (L/2-w)) lo &= ~(far >> (L-L/2-w)) hi &= ~(far << w)

The following drawing shows an example of the bit correspondence. Each bit mask is shown as a shaded box. The rightmost mark is implicitly bit 0, and ascending bit positions go leftwards. Numbers on the ruler indicate mark positions. The *near *half of the ruler is shown flipped. The partial ruler has marks at {0,1,4,17,21,23} and a mark at 6 (red) is being added. The red lines shows how the new distances induced by the mark at 6 are computed by shifting the half-ruler masks. For example, the mark at position 21 is at distance 15 from the new mark at 6, so the corresponding bit in uncovered has to be cleared.

The influence of *far* on *lo* becomes a right shift in the code since it is first being effectively shifted left by a half-ruler's length. The particular point at which *uncovered* is split was chosen to simplify the update: the *near* part never affects *hi*.

An additional benefit of flipping half the ruler is that is makes symmetry testing cheap. That test is part of avoiding redundant work.

The file *bitscrunch.cpp* has two implementations of the key computations. Both use the intrinsic *_mm_popcnt_u64* to compute the size of set uncovered. The differences in the implementations are:

- One uses mostly standard C++ operations for shifts and bitwise operations. It limits a half-ruler's length to the number of bits in a
`size_t`

. - The other uses SSE to do 128- bit operations, and thus can operate up to
*L*=255

The SSE version is enabled by default. Change the *#if 0 *in the source to #if 1 to use the other one.

## An Asymptotically Faster But Practically Slower Approach

For perfect rulers, *L *is quadratic in *M*, so the bit mask approach causes the key operations on *uncovered *to take time O(*M*^{2}). Given that the contest involved problems of up to M=97, I initially steered away from using bit masks. My first algorithm (see file *scrunch.cpp*) used a byte array to represent *uncovered*. Each element *uncovered*[*k*] was 1 if element *k *was present, and 0 otherwise. When adding a mark, the array could be updated in O(*M*) time for the O(*M*) new distances created by the new mark, and the size of the set could be maintained incrementally. Alas once the contest devolved into "beat Wichmann", it became clear that *M* was going to be no more than about 25, and so the constant factor advantages triumphed. However, my first algorithm has no limitations due to word sizes, so I used it to polish off some problems with *L *beyond the limits of my bit mask algorithm. For these problems, *M* was much too small relative to *L* for there to be solutions, so the search ran fast enough to get the desired negative results.

The recursive method *scrunch *in *scrunch.cpp* is essentially the same as previously discussed in the abstract, but is harder to follow because each level of recursion extends only one green zone at a time, alternating between left and right as it recurses.

## Parallelization

The code is parallelized with Intel® Cilk™ Plus, which excels at recursive parallelism. Recursive calls are spawned using *cilk_spawn*. Like many parallel recursive codes, there is a point at which the parallel scheduling overheads become serious drag, so the code uses serial recursion when *avail<Cutoff*. The following slightly simplified excerpt from *bitscrunch.cpp* shows the key logic. Macro SCRUNCH handles the cutoff logic.

template<typename T> T Clone( const T& x ) {return x;} #define SCRUNCH(a,w,l,r,u) do { if( (a)>=Cutoff ) cilk_spawn Scrunch(a,w,Clone(l),Clone(r),Clone(u)); else Scrunch(a,w,l,r,u); } while(0) static void Scrunch( int avail, int w, const HalfRuler& left, const HalfRuler& right, const UncoveredSet& uncovered ) { if( auto u = uncovered.count() ){ if( u > MaxUnique[avail] ) return; // Too many distances to cover while( 2*w < L-avail ) { ++w; UncoveredSet ul(uncovered); ul.strike( left, right, w ); // Remove distances covered by mark at w on left half UncoveredSet ur(uncovered); ur.strike( right, left, w ); // Remove distances covered by mark at w on right half HalfRuler lw(left,w); // Left half with mark at w HalfRuler rw(right,w); // Right half with mark at w if( 2*w!=L ) { SCRUNCH( avail-1, w, lw, right, ul ); // Try w on left if( left!=right ) SCRUNCH( avail-1, w, left, rw, ur ); // Try w on right if( avail>=2 ) { UncoveredSet ub(ul&ur); // Try w on both sides ub.remove(L-2*w); SCRUNCH( avail-2, w, lw, rw, ub ); } } else { Scrunch( avail-1, w, lw, rw, ul&ur ); // w is centerpoint break; } if( uncovered.hasFar(w) ) break; // Long distance not covered } } else { TallyRuler(avail, left, right); // Found a complete ruler } }

Function template *Clone* solves some lifetime issues. When a function is spawned, any compiler-generated temporary objects for the call live until the spawnee returns. However, some of the parameters are references to local variables whose lifetime is limited to the body of the *while *loop. Because the *while* loop is allowed to proceed while the spawned calls are still running, it is essential to copy those local variable with function *Clone*. An alternative would have been to pass all parameters by value to *Scrunch*, but experiments indicated that was slightly slower.

Macro *SCRUNCH *must be a macro, not an inline function, because Cilk is *fully strict*, meaning that a function cannot return until all its spawnees have finished. An alternative, as in the language X10, is *terminally strict*semantics where a function can return with spawnees still running. Though fully strict semantics simplifies reasoning about programs and simplifies implementation of Cilk, here terminally strict semantics would have been more convenient.

The lack of any synchronization in the *while *loop makes the "continuation-steal" semantics of Cilk essential for bounding memory consumption. In a "child-steal" system (such as TBB or OpenMP's "tied tasks"), idle worker threads steal the spawned tasks, so the current thread would execute the loop to completion. Doing so could pile up a lot of tasks before any executed, whenever the rest of the workers are busy, which should be the common case. In a "continuation-steal" system such as Cilk, the current thread dives immediately into the spawned task, and leaves the continuation of the caller to be stolen by an idle worker. That way the while loop self-throttles. It advances only when an idle worker becomes available.

### Determinism

The programs are deterministic, despite having no locks! The programs achieve this by using Cilk Plus reducers. Each reducer provides multiple views. Each parallel strand of execution sees a different view and thus each strand can update its view without interfering with other strands. The Cilk Plus run-time automatically merges views as appropriate.

Here are the declarations of the two reducers:

// Number of rulers found. Asymmetric rules are counted twice. static cilk::reducer_opadd<int> Count; // Output stream static cilk::reducer_ostream Out(std::cout);

Each reducer acts like a "pointer to view". Here's the code in *TallyRuler *that updates *Count *and *Out*.

// Report a sparse ruler solution. void TallyRuler( int avail, HalfRuler left, HalfRuler right ) { *Out << "Ruler[" << M-avail << "]:"; // Print marks in sorted order. Algorithm is quadratic, but takes relatively small // effort compared to the exponential effort used to compute the marks. for( Mark i=0; i<=L; ++i ) if( i<=HalfL && left[i] || RestL<=i && right[L-i] ) *Out << " " << unsigned(i); *Out << std::endl; *Count += 1 + (left!=right); }

Reducers work for any associative operation, because associativity lets the run-time reassociate updates and merging of views. The *Output *stream operations are associative because output is partially buffered, and concatenation of buffers is an associative operation. For the full theory behind reduces, see "Reducers and Other Cilk++ Hyperobjects".

### Scaling

For large ruler problems, the code scales well, which is not surprising given that:

- There is much more available parallelism than the hardware can exploit ("parallel slack"). The slack permits Intel® Cilk™ Plus's work-stealing scheduler to automatically balance the load.
- The working set is tiny. Thus memory bandwidth does not become the resource bottleneck.

Here's a quick primer on measuring scaling. The execution time of a program running with P workers (hardware threads) is denoted T_{P.}. The serial execution time is T_{1}. The *speedup *is T_{1}/T_{P}. It's called *relative *speedup if T_{1} is the parallel program running on 1 worker, and *absolute *speedup if T_{1} is the time of the best known sequential program, which is not paying any parallel overhead or having to restrict itself to a parallelizable algorithm. The *efficiency *of a parallel program is speedup divided by P. It also comes in relative and absolute flavors, depending upon which kind of speedup is being used. An absolute efficiency of 1.0 is ideal, and generally hard to achieve.

I ran the programs on an SGI UV Altix 1000 that has 256 cores. Each core has a single hardware thread. The following graph shows the efficiency of *bitscrunch* for M=21 L=153 on that machine:

The graph shows that the *absolute *efficiency of the program is within a factor of 0.8 of ideal, even when scaled out to 256 workers! Note the cut off vertical axis. The "cliff" is really quite small -- the relative efficiency is within *4%* of perfect *relative *scaling.

My longest run, for M=25 and L=208, took almost 51 hours using 256 cores.

## Inlining

Alas many programmers overuse C++'s inline feature. For example, it is often convenient to define member functions (not just declare them) inside a class, but doing so implicitly marks them *inline*. Doing too much inlining can hurt performance by increasing instruction-cache misses. Consequently modern compilers employ heuristics that consider the programmer's requests for inlining, but do not always obey them. Profiling an earlier version with Intel® VTune™ indicated that some key routines were not being inlined by icc that I really did want inlined. Hence the code has *#pragma forceinline* in places to force inlining. Similarly, the code has an explicit* inline *keyword (technically redundant) for method *UncoveredSet::count*, which enticed icc into some desired inlining.

## Ruler Results -- A Surprising Gap

The attached zip file [last updated 2014-Feb-10] contains a table *SparseRulerCounts.pdf* with results of the programs. It goes up to *M*=25. The Count column indicates the number of rulers found for a given *M *and *L*. Most of the results were generated by *bitscrunch.cpp*. Results for *L*>=250 were generated with *scrunch.cpp*.

The table confirms that Wichmann's formula produces the only optimal rulers for 13<*M*<=25.

The new discovery is the existence of gaps in the table: there are values of *M* and *L *for which a perfect ruler can be built, but not for *M* and *L*-1. For example, part of the table looks like this:

The rows in yellow were confirmed by Tomas Sirgedas with his independently written program. (He showed remarkable patience, as he was running on 4 cores, not a 256 cores.) The gap, when present, seems to grow with increasingly large rulers. It suggests that the Wichmann formula, and a "off by one" variant, not only beat the other solutions, but peel away from the other solutions as the problem size gets larger.

## Acknowledgments

Thanks to Tomas Sirgedas for suggesting the bit-mask flipping trick and verifying some of the gaps. He and Neal Faiman reviewed an earlier draft of this article. Thanks to the parallel run-time teams for letting me borrow the 256-core SGI UV Altix on the weekends. I felt like a teenager borrowing the family car for a spin.

See https://www.cilkplus.org/ to learn more about Intel® Cilk™ Plus