Is cilk_sort functions parallel drop in replacements for the C qsort function?

# Question on cilk_sort

## Question on cilk_sort

For more complete information about compiler optimizations, see our Optimization Notice.

I assume you are referring to the cilk_sort functions implemented in Cilkpub.

http://www.cilkplus.org/sites/default/files/contributions/cilkpub_v104.tar.gz

The cilk_sort_in_place and cilk_sort functions are intended to be drop-in replacements for a C++ std::sort. (They are implemented using C++ templates). In principle, it should be possible to implement a C interface for it, but it hasn't been done yet.

Cheers,

Jim

I been trying to figure out a better way to take advantage cilk_spawn on sorting in my C search program, I've taken a qsort function that has a recurive call to itself to take advantage of cilk_spawn. When I use this new qsort function my parelllelism numbers go down not up.

Currently without this new qsort my script has the following numbers:

Cilk Parallel Region(s) Statistics - Elapsed time: 0.269 seconds 1) Parallelism Profile Work : 810,586,307 instructions Span : 469,320 instructions Burdened span : 25,552,139 instructions Parallelism : 1727.15 Burdened parallelism : 31.72 Number of spawns/syncs: 2,349,266 Average instructions / strand : 115 Strands along span : 926 Average instructions / strand on span : 506 Total number of atomic instructions : 2,349,570 Frame count : 4,698,532 Entries to parallel region : 51 2) Speedup Estimate 2 processors: 1.90 - 2.00 4 processors: 3.45 - 4.00 8 processors: 5.82 - 8.00 16 processors: 8.87 - 16.00 32 processors: 12.02 - 32.00 64 processors: 14.62 - 64.00 128 processors: 16.40 - 128.00 256 processors: 17.46 - 256.00

With a new qsort using cilk_spawn I get the following numbers:

Cilk Parallel Region(s) Statistics - Elapsed time: 0.116 seconds 1) Parallelism Profile Work : 931,720,926 instructions Span : 58,014,130 instructions Burdened span : 112,606,176 instructions Parallelism : 16.06 Burdened parallelism : 8.27 Number of spawns/syncs: 2,397,257 Average instructions / strand : 129 Strands along span : 2,289 Average instructions / strand on span : 25,344 Total number of atomic instructions : 2,397,921 Frame count : 4,794,514 Entries to parallel region : 96 2) Speedup Estimate 2 processors: 1.66 - 2.00 4 processors: 2.47 - 4.00 8 processors: 3.28 - 8.00 16 processors: 3.92 - 16.00 32 processors: 4.34 - 16.06 64 processors: 4.59 - 16.06 128 processors: 4.72 - 16.06 256 processors: 4.79 - 16.06

Here is the code I was using to try and take advantage of the cilk_spawns:

static void cilkqsort(void *a, size_t n, size_t es, int (*cmp)(const void *, const void *)); static inline char *med3(char *, char *, char *, int (*)(const void *, const void *)); static inline void swapfunc(char *, char *, int, int); #define min(a, b) (a) < (b) ? a : b /* * Qsort routine from Bentley & McIlroy's "Engineering a Sort Function". */ #define swapcode(TYPE, parmi, parmj, n) { \ long i = (n) / sizeof (TYPE); \ TYPE *pi = (TYPE *) (parmi); \ TYPE *pj = (TYPE *) (parmj); \ do { \ TYPE t = *pi; \ *pi++ = *pj; \ *pj++ = t; \ } while (--i > 0); \ } #define SWAPINIT(a, es) swaptype = ((char *)a - (char *)0) % sizeof(long) || \ es % sizeof(long) ? 2 : es == sizeof(long)? 0 : 1; static inline void swapfunc(char *a, char *b, int n, int swaptype) { if(swaptype <= 1) swapcode(long, a, b, n) else swapcode(char, a, b, n) } #define swap(a, b) \ if (swaptype == 0) { \ long t = *(long *)(a); \ *(long *)(a) = *(long *)(b); \ *(long *)(b) = t; \ } else \ swapfunc(a, b, es, swaptype) #define vecswap(a, b, n) if ((n) > 0) swapfunc(a, b, n, swaptype) static inline char * med3(char *a, char *b, char *c, int (*cmp)(const void *, const void *)) { return cmp(a, b) < 0 ? (cmp(b, c) < 0 ? b : (cmp(a, c) < 0 ? c : a )) :(cmp(b, c) > 0 ? b : (cmp(a, c) < 0 ? a : c )); } static void cilkqsort(void *a, size_t n, size_t es, int (*cmp)(const void *, const void *)) { char *pa, *pb, *pc, *pd, *pl, *pm, *pn; int d, swaptype, swap_cnt; int r; loop: SWAPINIT(a, es); swap_cnt = 0; if (n < 7) { for (pm = (char *)a + es; pm < (char *) a + n * es; pm += es) for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0; pl -= es) swap(pl, pl - es); return; } pm = (char *)a + (n / 2) * es; if (n > 7) { pl = a; pn = (char *)a + (n - 1) * es; if (n > 40) { d = (n / 8) * es; pl = med3(pl, pl + d, pl + 2 * d, cmp); pm = med3(pm - d, pm, pm + d, cmp); pn = med3(pn - 2 * d, pn - d, pn, cmp); } pm = med3(pl, pm, pn, cmp); } swap(a, pm); pa = pb = (char *)a + es; pc = pd = (char *)a + (n - 1) * es; for (;;) { while (pb <= pc && (r = cmp(pb, a)) <= 0) { if (r == 0) { swap_cnt = 1; swap(pa, pb); pa += es; } pb += es; } while (pb <= pc && (r = cmp(pc, a)) >= 0) { if (r == 0) { swap_cnt = 1; swap(pc, pd); pd -= es; } pc -= es; } if (pb > pc) break; swap(pb, pc); swap_cnt = 1; pb += es; pc -= es; } if (swap_cnt == 0) { /* Switch to insertion sort */ for (pm = (char *)a + es; pm < (char *) a + n * es; pm += es) for (pl = pm; pl > (char *) a && cmp(pl - es, pl) > 0; pl -= es) swap(pl, pl - es); return; } pn = (char *)a + n * es; r = min(pa - (char *)a, pb - pa); vecswap(a, pb - r, r); r = min((size_t)(pd - pc), pn - pd - es); vecswap(pb, pn - r, r); if ((size_t)(r = pb - pa) > es) { cilk_spawn cilkqsort(a, r / es, es, cmp); cilk_sync; } if ((size_t)(r = pd - pc) > es) { /* Iterate rather than recurse to save stack space */ /*a = pn - r;*/ /*n = r / es;*/ /*goto loop;*/ cilk_spawn cilkqsort(a, r / es, es, cmp); cilk_sync; } cilk_sync; }

What am I doing wrong with the new qsort function?

Thanks!

**Quote:**

Lawrence R.wrote:Is cilk_sort functions parallel drop in replacements for the C qsort function?

No. Not least because, at least up to C99 (required for C++ 11), the comparison function must be called serially. C11 does not require that explicitly, but makes no sense if it is not the case.

**Quote:**

Arch D. Robison (Intel)wrote:

The naive version certainly doesn't, but there are versions that I am pretty certain would scale about as effectively as merge-sort. However, my estimate is that the merge-sort version will be a significant but small constant factor faster than even the best quicksort variant.

I would be interested in hearing about a version of quicksort that has more than O(log N) parallelism.

Parallel mergesort has O(N/log^2 N) parallelism if I recall correctly.

I don't think so. Consider the final merge step. In a more-or-less standard merge-sort, that is run serially and, even in Arch's version, that's only two way. That means (using the simplistic Knuth-era model of computational cost) you are reducing N log_2 N passes to at least N/2 - that's only O(log N) parallelism.

An algorithm that I invented decades ago but never published (though it has been published since) uses a quicksort variant and scales well (but with a nasty constant factor). Take N threads, and calculate the r/N quantiles in parallel (that is easy in time roughly N log N, using a probabilistic approach). Then split (and copy) the data in parallel. Then sort each segment in parallel. Then merge in parallel. Everything is running fully N-way parallel, and there is only a constant factor more work. QED O(N) scalability!

TANSTAAFL, however. That gives O(sqrt(N)) speedup on a distributed memory system, which is about as good as you will get, but hits the memory subsystem of a shared memory where it hurts - if I get time, I may code it up and see how it goes, but I wouldn't bet on it being any better than a memory-friendly but less scalable one :-) The point is that the Knuth-era model stopped being reasonable in the early 1970s on mainframes and mid-1980s on microcomputers.

Incidentally, do you know that there is a near-perfectly vectorisable Shellsort that is O(N log^2 N)? I tested that on a true vector system (Hitachi S-3600) and it was no faster than a serial one because of the memory issues! Very predictable.

Take a look at Cormen, Leiserson, Rivest, and Stein (3rd edition), which shows that mergesort has O(N/log n) parallelism.

The key idea is that a merge of two sorted arrays can be done in parallel. To merge A of length N and B of length M to get C of length N+M you do the following recursive program. Assume, without loss of generality that N>M, and that M>0 (so I'm not talking about the base case of the recursion:

- let x = A[N/2] // the median value in A
- let i = find(x, B); // find the location where x belongs in B. Thus B[i] < x <= B[i+1]. Can be done in O(log M) time.
- spawn merge(A[0:N/2], B[0:i], C[0:N/2+i]);
- spawn merge(A[N/2:N], B[i,M], C[N/2+i:N+M])
- sync

Thus you can, after doing only O(log N) work, convert a single big merge into two parallel merges.

If you work it out, the span is O(log^2 N), and the work is O(N log N), so the parallelism is O(N/log N).

I don't know how to do anything near that well for quicksort, and I'd be interested in learning about it if it exists.

It's a little unrelated to Cilk, but since you brought it up, I'll just mention that for distributed memory machines, one of the best sorts for large data is a sample sort. Suppose there are P processors. Find something like (P log P) samples, calculate the ranks of those items, and then send the right data to the right processors. You can read about such programs at http://web.archive.org/web/20140401021149/http://sortbenchmark.org (the sortbenchmark.org web page seems to be down, so you have to use the wayback machine to see it.) (I hold the world championship in the Indy Terabyte category)

**Quote:**

Bradley K.wrote:Take a look at Cormen, Leiserson, Rivest, and Stein (3rd edition), which shows that mergesort has O(N/log n) parallelism.

If you work it out, the span is O(log^2 N), and the work is O(N log N), so the parallelism is O(N/log N).

I don't know how to do anything near that well for quicksort, and I'd be interested in learning about it if it exists.

Ah. Cross-purposes. That's using the 1960s model of complexity which, as I said, hasn't corresponded with actual computers since the early 1970s (except for c. 5 years, when it did on microprocessors). But let's stick with that one as an abstract model.

I can't remember the exact equivalent complexity of quicksort, because it is too many decades since I did the analyses, but the key is that it is the dual of merge-sort, and you do precisely the above the other way round! I.e. you calculate the median and split the vector in parallel (see, for example, theory.stanford.edu/~megiddo/pdf/parmedia.pdf, for the former) - and, like that analysis, do the job recursively . In practice, it doesn't help - nor does the algorithm you describe, though the merge-sort version does better. In any case, the complexities of the two are almost identical, which isn't surprising (they being duals of one another).

**Quote:**

Bradley K.wrote:It's a little unrelated to Cilk, but since you brought it up, I'll just mention that for distributed memory machines, one of the best sorts for large data is a sample sort. Suppose there are P processors. Find something like (P log P) samples, calculate the ranks of those items, and then send the right data to the right processors. You can read about such programs at http://web.archive.org/web/20140401021149/http://sortbenchmark.org (the sortbenchmark.org web page seems to be down, so you have to use the wayback machine to see it.) (I hold the world championship in the Indy Terabyte category)

Yes, that's the one I am referring to - and you may well be the one who first published! My unpublished analysis calculated the optimal value for the sameple size - I night be able to find it if you are interested. Anway, its complexity is O(sqrt(N)), using a realistic model of communication cost, and I have never seen a better algorithm for distributed memory. What I don't know is how effective it is relative to a parallel merge-sort on shared memory, and have't had time to do both - do you?

**Quote:**

Arch D. Robison (Intel)wrote:

I must get back to my project and test your (and my) code properly! I will take another look at your article, but I don't see that a good variant of samplesort need scale any less well - I fully agree that a naive one could be horrific. TLB issues are very hard to analyse, because of the plethora of different variants of them.

Too often with sorting performance issues is too much focus is placed on the time to produce a fully sorted list, when the focus quite possibly should be placed on what you do with the results of the sort. In the case of a merge sort, the last merge phase (be it two-way or n-way) can directly feed the next stage of the problem, thus eliminating the latency of the final merge phase, and possibly eliminate the final buffer for the results. If the results are to be used more than once, then the initial consumer of the 2/n-way final merge can build the results file for re-use later on. This will decrease the throughput of that phase, but at the benefit of: being run earlier, and eliminating one read pass of the data.

Jim Dempsey

I'd be interested in an example of the pattern that Jim Dempsey describes. Especially if it were expressed in Cilk.

-Bradley

Even though parallel quicksort has problems scaling as Arch and others have described, I'm not sure if that is actually the problem the original poster is seeing. I see several cases of a cilk_spawn of a function followed immediately by a cilk_sync. It seems like that may be effectively negating any attempt at parallelism...

if ((size_t)(r = pd - pc) > es) {

/* Iterate rather than recurse to save stack space */

/*a = pn - r;*/

/*n = r / es;*/

/*goto loop;*/

cilk_spawn cilkqsort(a, r / es, es, cmp);

cilk_sync;

}

Cilk Plus functions have strict fork-join parallelism, i.e., any nested spawns inside cilkqsort will sync before cilkqsort returns. The cilk_sync also sync's all the outstanding spawns in the current Cilk Plus function.

Cheers,

Jim