Good question! My flip answer was going to be "implement quicksort using Intel^{®} Cilk™ Plus" but then I remembered that quicksort itself is not a stable sort (and in fact faces worst case performance in sorting presorted lists). One of the attributes of Intel Cilk Plus parallelism is that it automatically preserves order of operation, an important attribute in performing a stable sort. And quicksort's nature as a divide-and-conquer algorithm--subdividing activity in ever smaller regions--makes it good for parallelization. But its tendency to rearrange items even on the small end makes it a problematic choice for ensuring stability.

# Parallel Stable Sort

An easy way to perform a stable sort would be to modify the comparison function. When content of keys are ==, use address of key to specify key of lower address as less than other key. IOW the external results of the comparison will never show ==.

Jim Dempsey

It seems counterintuitive that it would be as simple as modifying the comparison function, because then why would we even be talking about it? I think that quicksort will happily disturb order in equivalence class A while partitioning around an element of equivalence class B even if in-class disturbance is avoided... but such transient preservation of order is later trashed anyway by subsequent partitioning around pivots from different classes. A solution might be to sort an array of indexes/pointers (so the original location is being preserved), and then pick the original values into a result collection. Alternatively to the picking phase, one could sort the original collection alongside a collection of original indexes, simultaneously performing equivalent operations on both. Or transcribe to pairs, sort, and transcribe to elements of the original type. Maybe it even depends on cache effects which of these will be a winner: your choice between trying each of these yourself or doing a bit more research first (I didn't).

Robert, could you expand on the thing with order of operation and its relevance for sorting?

(Correction 2013-06-01) during subdivision->while partitioning, replaced doubt about not disturbing order in same equivalence class as pivot with explicit remark about its ultimate irrelevance

Raf,

In lieu of manufacturing a pointer and carrying it about (assuming result is vector of T as opposed to vector of T*) the == condition could perform a byte by byte comparison of the entirity of each object being sorted (memcmp), the one with the lesser value can be chosen, or in the event both records are identical, it will not make a difference as to which is used (as the result vector is returning a copy of T and you could not tell the difference).

Jim Dempsey

Not only will that not make sort stable, it will make stable sort non-stable, e.g., for "bool f(int x, int y) { return (x%2)<(y%2); }" and input {2,1,0,3} you always get {0,2,1,3} instead of {2,0,1,3}.

(Correction) Hopefully nobody saw the original version...

(Added 2013-05-30) For clarity: I recognise that your suggestion makes sort "deterministic" (result is predictable), but not in the specific way that is meant by "stable" (order of equivalent elements is preserved).

(Correction 2013-06-01) Removed remark about "idempotent" from addition (strike-through didn't work). Note to self: don't add unnecessary comments, especially not without thinking. :-)

Raf, "bool f(int x, int y) { return (x%2)<(y%2); }"

The order returned depends on how you structure the merge phase of the sort paying attention to left-leg or right-leg.

Jim Dempsey

Now I think I've lost you, Jim: there's no merge in quicksort. (The predicate "even before odd" was just to have short example sequences.)

Robert, I don't think there's any indeterminacy related to parallelism at all, only related to pivot choice and partitioning algorithm, and the latter is annoyingly sequential. So you can plug the same strategy into a sequential quicksort, or ones implemented in Cilk Plus or TBB, and expect to get the same outcome for all three.

(Added 2013-06-01) I think that TBB deserves a parallel_stable_sort() implemented either on top of parallel_sort as indicated in an earlier comment or as a parallel merge sort (inherently stable). The former is easy to implement, the latter may well be faster.

(Corrected 2013-06-04) In English, it's "even" and "odd", not "pair" and "unpair" (Dutch: "paar" and "onpaar"). :-)

Raf, you might be right. When I first thought about the issue, the possibility occurred to me that parallel splitting of the work might be another source for instability, much in the same way that parallelization of reductions can sometimes induce instabilities in floating-point results because of varying orderings of the binary ops involved. I don't have a specific case in mind but I wasn't discounting the possibility. And I don't have a specific case in mind, so there may be no "there" there.

Unless you have something that explicitly depends on order of evaluation, like a pseudo-random choice of pivot element, it's always the same subdivision tree. But that's all academic because quicksort isn't stable even in a serial implementation.

(Added 19:40Z) I just completed successful testing of a parallel merge sort implementation based on "Chapter 13: Merge Sort" in "Structured Parallel Programming", the book by Michael McCool, Arch D. Robison and James Reinders featured on the TBB website. I have a feeling that it could be further improved by using multi-way merge (beyond binary merge), because a shallower merge tree means less data motion/movement (?).

(Added 2013-06-06) I just completed successful testing of such a parallel multi-way merge sort implementation. There are still some corner cases that may warrant special attention (empty and mutually disjunct ranges), and there are some values to be tuned, but then I think it'll be ready to be declared a winner (of hearts). :-)

Hi, can I use std::stable_sort on tbb::concurrent_vector without any problem if the concurrent_vector isn't being modified by other threads?

Yes. You might also want to take the opportunity to first compact() the container.

If I don't compact the concurrent_vector, should there be any problem? Just for curiosity.

concurrent_vector trades constant-time access away for the guarantee that elements stay in place (necessary for concurrency, good for cache locality). If you're not absolutely certain that most of the data is hot in cache, you might want to try to gain back some access speed, especially if you are going to be navigating around quite a bit. But by all means do your own tests (and let us know the outcome).

(Added 2013-06-09) More explicitly, you don't have to first compact() for std::stable_sort() to work correctly.

I've found it GCC parallel has a method __gnu_parallel::stable_sort. Problem is it's only on GCC, not portable across paltforms

Hold on, I've just submitted an implementation to TBB, so you can develop using __gnu_parallel::stable_sort() and substitute tbb::parallel_stable_sort()... at an undefined time in the future. :-)

(Added 2013-06-10) Hmm, the timings so far are all over the place. If I look only at 50000 elements and 4 threads (the maximum in test_parallel_sort.exe, running on a machine with 4 hyperthreaded cores), I see speedups over parallel_sort() anywhere between about 0.25 and 5. random->forward->reverse shows decreasing speedup. Best performance is for array of float, possibly because of memmove() optimisation. Food for thought...

(Added) There should also be tests with "almost sorted" sequences, because now the speedup results for "pre-sorted" are narrowly distributed around 1, because the same preliminary test is run each time. It's not even an exceptional situation, so it's not as if I *just* want to see quicksort fall hopelessly behind...

(Added) At least parallel multi-way merge sort (my hunch) shows better results than binary (from a book): over 5x vs. less than 2x speedup over quicksort. I'm not entirely sure they've each been optimised to the same level, but I'm going to spend at least an entire day not worrying about that possibility. :-)

For 500000 elements in the test, a silly bug reared its head (only then!) and was squashed, but also binary merge stagnated at speedups below 2 and multi-way merge (narrowly) exceeded 10, prior to any tuning and adjustments (especially about choice of number of parts in the division). Another thing to consider is to not eagerly swap over subranges that are already sorted but to leave them in place unless they need to be merged, just to take another head start on a race that has not been run yet: sorting a mostly sorted sequence. Still, the results were very uneven again, down to 0.2 for some inputs, mostly concurrent_vector with reverse-sorted input, so if that were realistic you might even want to copy that over to a vector, sort there, and copy back afterwards.

(Added 2013-06-11) Another addition to my mini-blog... The upside of all-over-the-place timings is that if you divided times the wrong way around, you still just get all-over-the-place timings after correcting that mistake, so you don't really have to go hide from the world right away. Still, that glorious factor 10 has now become a terrible slowdown, so I'm not very happy about that. Lazy swapping wasn't what I hoped for, but I still don't have the mostly sorted input data where it might be most useful. Next thing would be to improve the multi-way merge itself, which now still does a full sweep each time, which gets costly as the merge gets wider. But that's for another evening.

(Added 2013-06-14) I've implemented the multi-way merge with a priority_queue, but not (yet?) with the outcome I wanted.

Time to face the truth: even after pulling out all the stops, with type traits, trivial copy, multi-way merge and whatnot, merge sort is not a clear winner on performance alone when stability is not a requirement, or at least I couldn't make it be one. Quicksort does take a few partitioning rounds before getting up to full speed, but only one is sequential, then it fills 2 threads, then 4, 8, etc., and apparently it's not that big of a handicap. Then, as long as swapping is cheap enough and there's no existing order to exploit, it wipes the floor with merge sort. In the results below for 8-way parallelism (not necessarily the width of the merge, which was arbitrarily set to 64 here) and 5000000 elements as input, disregarding the rather expensive concurrent_vector, the only consistent advantage to speak of is reverse-sorted input (but if you know that property in advance you are probably a lot better off with std::reverse). Still, it could have been worse, I guess.

Minimal[] (comp) sin p=8 n=5000000 :0.057377 seconds 0.214162 seconds relative speed 0.267914

Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001647 seconds 0.001834 seconds relative speed 0.898037

Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.073003 seconds 0.042997 seconds relative speed 1.697863

float[] (no comp) sin p=8 n=5000000 :0.088172 seconds 0.394136 seconds relative speed 0.223710

float[] (no comp) pre-sorted p=8 n=5000000 :0.001702 seconds 0.001755 seconds relative speed 0.969801

float[] (no comp) reverse-sorted p=8 n=5000000 :0.089717 seconds 0.050909 seconds relative speed 1.762301

float[] (no comp) mostly sorted separate p=8 n=5000000 :0.080557 seconds 0.053944 seconds relative speed 1.493345

float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.058281 seconds 0.076188 seconds relative speed 0.764963

float[] (comp) sin p=8 n=5000000 :0.088116 seconds 0.377433 seconds relative speed 0.233461

float[] (comp) pre-sorted p=8 n=5000000 :0.001624 seconds 0.001624 seconds relative speed 1.000000

float[] (comp) reverse-sorted p=8 n=5000000 :0.090205 seconds 0.049636 seconds relative speed 1.817330

float[] (comp) mostly sorted separate p=8 n=5000000 :0.078963 seconds 0.050833 seconds relative speed 1.553381

float[] (comp) mostly sorted mixed p=8 n=5000000 :0.057468 seconds 0.076679 seconds relative speed 0.749462

concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.174679 seconds 0.465071 seconds relative speed 0.375596

concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004398 seconds 0.004435 seconds relative speed 0.991657

concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.266924 seconds 0.081756 seconds relative speed 3.264886

concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.220463 seconds 0.090405 seconds relative speed 2.438615

concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.147378 seconds 0.114316 seconds relative speed 1.289216

concurrent_vector<float> (comp) sin p=8 n=5000000 :0.177702 seconds 0.465342 seconds relative speed 0.381874

concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004786 seconds 0.004368 seconds relative speed 1.095696

concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.259108 seconds 0.082077 seconds relative speed 3.156889

concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.216838 seconds 0.078982 seconds relative speed 2.745410

concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.141579 seconds 0.114651 seconds relative speed 1.234869

string[] (no comp) sin p=8 n=5000000 :1.072494 seconds 1.620175 seconds relative speed 0.661962

string[] (comp) sin p=8 n=5000000 :1.435451 seconds 1.717984 seconds relative speed 0.835544

concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.135827 seconds 0.274697 seconds relative speed 0.494461

concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004724 seconds 0.005422 seconds relative speed 0.871265

concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.247340 seconds 0.078512 seconds relative speed 3.150346

Raf,

I have an observation and a suggestion (if you are up for the task).

Observaton: First pass of Quick Sort is made by making a best guess at median key and using that as pivot point. As you stated, this restricts the first pass to single thread (unless you use leader-follower where leader selects the to-be-swapped elements and follower performs the swap). For second pass you have a better guess at the two pivot points, and two threads can run, then third pass with 4 threads, etc...

Suggestion: First pass to select multiple pivot points such that at the end of the first pass you have N partitons (N= number of threads). At the start of the second pass, each of the threads will know the number of entries in all partitions. Using this information, they then each select an optimal number of sub-sub-partitions. At some point each sub-...-sub-partition is optimally partitioned by 2.

Note, the first pass is still run with single thread (unless effective leader-follower technique can be found). This pass should be slighty slower since you are comparing N keys however the N-1 keys of the next cycle will be in L1 cache.

Jim Dempsey

How would you partition into more than 2 parts? Use lots of external storage? An extra (M-1)*N spaces would allow direct partitioning into 2*M parts, I guess, as a first approach anyway. Otherwise, unless I missed something, the partitioning thread first has to count how many elements will be in each partition before it actually goes about partitioning, and that doesn't seem very efficient, because it creates more total work, unless there's nothing else for the machine to do in the meantime and you don't mind the extra proverbial CO2. Would you really think that two-phase fully in-place multi-way partitioning would already be able to prevail over the current implementation, though? Maybe there's a minimally required degree of parallelism (2 wouldn't work)?

Meanwhile, I'm still with the merge sort, because I may have found a cause why lazy swapping didn't work yet, and a possible workaround.

(Added) I had made a copy&paste mistake in the test, so the supposedly mostly sorted data was actually mostly reverse-sorted, which caused the algorithm to always detect the need to merge, which always required swapping. The only benefit for merge sort was that merging disjunct ranges was optimised, especially for trivially copyable types and pointer iterators. However, when I corrected the mistake... quick sort benefited more than merge sort! So I had to do some more work to allow better lazy swapping and merging (by detecting an already sorted prefix in the current subrange). I'm not unhappy with the outcome, and you can see for yourself:

Minimal[] (comp) sin p=8 n=5000000 :0.049921 s 0.185660 s relative speed 0.268884

Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001697 s 0.001600 s relative speed 1.060625

Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.067058 s 0.040170 s relative speed 1.669355

float[] (no comp) sin p=8 n=5000000 :0.096416 s 0.376344 s relative speed 0.256191

float[] (no comp) pre-sorted p=8 n=5000000 :0.001785 s 0.001505 s relative speed 1.186047

float[] (no comp) reverse-sorted p=8 n=5000000 :0.087632 s 0.045927 s relative speed 1.908072

float[] (no comp) mostly sorted separate p=8 n=5000000 :0.024296 s 0.019774 s relative speed 1.228684

float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.047737 s 0.050908 s relative speed 0.937711

float[] (comp) sin p=8 n=5000000 :0.084850 s 0.335659 s relative speed 0.252786

float[] (comp) pre-sorted p=8 n=5000000 :0.001589 s 0.001535 s relative speed 1.035179

float[] (comp) reverse-sorted p=8 n=5000000 :0.083609 s 0.052413 s relative speed 1.595196

float[] (comp) mostly sorted separate p=8 n=5000000 :0.023327 s 0.019870 s relative speed 1.173981

float[] (comp) mostly sorted mixed p=8 n=5000000 :0.054237 s 0.048402 s relative speed 1.120553

concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.158414 s 0.430017 s relative speed 0.368390

concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004316 s 0.004287 s relative speed 1.006765

concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.240558 s 0.081098 s relative speed 2.966263

concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.063967 s 0.026681 s relative speed 2.397474

concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.121318 s 0.064965 s relative speed 1.867436

concurrent_vector<float> (comp) sin p=8 n=5000000 :0.168943 s 0.439503 s relative speed 0.384396

concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004989 s 0.004517 s relative speed 1.104494

concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.238984 s 0.080120 s relative speed 2.982826

concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.065125 s 0.027061 s relative speed 2.406600

concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.121747 s 0.065032 s relative speed 1.872109

string[] (no comp) sin p=8 n=5000000 :1.060586 s 1.513491 s relative speed 0.700755

string[] (comp) sin p=8 n=5000000 :1.213007 s 1.704814 s relative speed 0.711519

concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.125093 s 0.254090 s relative speed 0.492318

concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004860 s 0.004798 s relative speed 1.012922

concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.238240 s 0.076140 s relative speed 3.128973

Remarkably, binary merge seems better on random input than 64-way merge, but not on mostly sorted input:

Minimal[] (comp) sin p=8 n=5000000 :0.049962 s 0.057742 s relative speed 0.865263

Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001503 s 0.001596 s relative speed 0.941729

Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.068662 s 0.040166 s relative speed 1.709456

float[] (no comp) sin p=8 n=5000000 :0.093805 s 0.086748 s relative speed 1.081351

float[] (no comp) pre-sorted p=8 n=5000000 :0.001773 s 0.001875 s relative speed 0.945600

float[] (no comp) reverse-sorted p=8 n=5000000 :0.083610 s 0.052436 s relative speed 1.594515

float[] (no comp) mostly sorted separate p=8 n=5000000 :0.023288 s 0.045520 s relative speed 0.511599

float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.047748 s 0.043098 s relative speed 1.107894

float[] (comp) sin p=8 n=5000000 :0.084759 s 0.062855 s relative speed 1.348485

float[] (comp) pre-sorted p=8 n=5000000 :0.001603 s 0.001500 s relative speed 1.068667

float[] (comp) reverse-sorted p=8 n=5000000 :0.084579 s 0.048307 s relative speed 1.750864

float[] (comp) mostly sorted separate p=8 n=5000000 :0.023402 s 0.044682 s relative speed 0.523746

float[] (comp) mostly sorted mixed p=8 n=5000000 :0.046864 s 0.046264 s relative speed 1.012969

concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.156299 s 0.120326 s relative speed 1.298963

concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004358 s 0.005489 s relative speed 0.793952

concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.230710 s 0.080521 s relative speed 2.865215

concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.065413 s 0.074384 s relative speed 0.879396

concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.121653 s 0.087962 s relative speed 1.383018

concurrent_vector<float> (comp) sin p=8 n=5000000 :0.155550 s 0.117563 s relative speed 1.323120

concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004350 s 0.004372 s relative speed 0.994968

concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.233149 s 0.080685 s relative speed 2.889620

concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.063447 s 0.075664 s relative speed 0.838536

concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.117638 s 0.077263 s relative speed 1.522566

string[] (no comp) sin p=8 n=5000000 :1.012791 s 1.638284 s relative speed 0.618202

string[] (comp) sin p=8 n=5000000 :1.220548 s 1.987904 s relative speed 0.613987

concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.125427 s 0.115113 s relative speed 1.089599

concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004465 s 0.004210 s relative speed 1.060570

concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.232239 s 0.081898 s relative speed 2.835710

Not too shabby:

string[] (no comp) essentially random p=8 n=5000000 :1.214567 s 1.188623 s relative speed 1.021827

string[] (no comp) pre-sorted p=8 n=5000000 :0.126007 s 0.117482 s relative speed 1.072564

string[] (no comp) reverse-sorted p=8 n=5000000 :3.546244 s 0.395822 s relative speed 8.959189

string[] (no comp) mostly sorted separate p=8 n=5000000 :1.485870 s 0.320982 s relative speed 4.629138

string[] (no comp) mostly sorted mixed p=8 n=5000000 :2.114791 s 0.484057 s relative speed 4.368888

string[] (comp) essentially random p=8 n=5000000 :1.493962 s 1.341368 s relative speed 1.113760

string[] (comp) pre-sorted p=8 n=5000000 :0.097033 s 0.081324 s relative speed 1.193166

string[] (comp) reverse-sorted p=8 n=5000000 :3.022029 s 0.411889 s relative speed 7.336999

string[] (comp) mostly sorted separate p=8 n=5000000 :1.293284 s 0.281310 s relative speed 4.597362

string[] (comp) mostly sorted mixed p=8 n=5000000 :2.002046 s 0.473519 s relative speed 4.228016

This is a straight win against quicksort for std::string[] (should be the same for std::vector<std::string>), with the best relative speed easily twice as good as the worst relative speed elsewhere (well, before prospect theory, of course...)!

For your education and entertainment, here's my contribution for parallel_merge() and parallel_stable_sort(). The only thing somebody has to do is take a weed whacker to it to eliminate the dead code, now left there for the benefit of whoever wants to run some comparisons, although it could still happen that one of several variations should be chosen based on some cut-off values.

## 附件:

**Quote:**

jimdempseyatthecovewrote:Suggestion: First pass to select multiple pivot points such that at the end of the first pass you have N partitons (N= number of threads). At the start of the second pass, each of the threads will know the number of entries in all partitions. Using this information, they then each select an optimal number of sub-sub-partitions. At some point each sub-...-sub-partition is optimally partitioned by 2.

Note, the first pass is still run with single thread (unless effective leader-follower technique can be found). This pass should be slighty slower since you are comparing N keys however the N-1 keys of the next cycle will be in L1 cache.

Do you or does anyone have a good experience with such a technique? I started to experiment with this, also based on "Chapter 14: Sample Sort" from the book "Structured Parallel Programming", using concurrent_vector flexible-size "buckets" instead of multiple passes to tally and scatter, but I rarely see an improvement over parallel_quick_sort (mostly looking at 8 threads and 5000000 elements). I verified that distribution is OK (except when most elements are equal). Could it be the redundant storage, the concurrent_vector (used only with push_back() and iterators, and capacity increments of what it needs on average, so fragmentation shouldn't be an issue, I would think)? Do I really have to do the multiple passes to get histogram information and perform a scatter to just-enough storage?

Perhaps you were suggesting something else, because the only sequential initial pass I'm doing is gathering samples (oversample, sort, sample at equal intervals)?

(Correction) After correcting a copy-paste error in one of the overloads (the one that's supposed to call the other with a std::less)... there is more upside visible, but whenever it's really significant it's trumped by parallel_stable_merge()! And it still runs slower than parallel quicksort sometimes. This is now also with a vector instead of a concurrent_vector, though.

For your education and entertainment, here's my contribution for parallel_merge(), parallel_stable_sort(), and now also parallel_sample_sort(), well, sort of.

Maybe somebody would like to try this, perhaps run some benchmarks (with pretty graphs), and provide some feedback (things I may have overlooked, ...)?

(Added 2013-07-14) I thought I'd post this early in the weekend and forget about it, but of course I should have known it wouldn't work out that way. I've already modified the code to be able to use a non-square buckets array (why should it be square at all, because more columns means a better partial sort, right?). And now I'm thinking about modifying the pack-and-sort step to do some splitting of its own, because it seems wasteful not to have at least a binary split at this point (each column packed to two subranges). I've already seen meaningful speedups over parallel quicksort in places (though not across the board), and there's some further progress in store. That's close contact with the evolutionary landscape of algorithms (with a touch of intelligent design)...

## 附件:

A question: what are the expectations that an optimising build will be able to translate std::copy() or std::move() into memcpy()-like territory on its own for various kinds of iterators (pointer, vector iterator, ...) and various trivially copyable types (fundamental types, structs, ...)? At some intermediate level it should become something recognisable for an optimiser, but can you actually trust that it will generate optimal code?

## Parallel Stable Sort

According to http://threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_sort_func.htm.

So I need a parallel stable sort. How can I get that?