# parallel_sort

## parallel_sort

Continued from "Conditional parallelization".

Hmm, apparently parallel_sort is currently implemented with parallel_for and a weird-looking quick_sort_range class that does the partitioning... and has a "grainsize" parameter. That's either really neat, because it follows naturally from the quicksort algorithm and uses only high-level constructs, or problematically naive man-with-a-hammer code, because the worst case may recurse to depth N (or so I hear, see below). The solution for that would be to recurse on the biggest subrange and to iterate on the smallest subrange. This could of course still be implemented in terms of quick_sort_range and intimate knowledge of which side will be delegated to a child task (with the ranges swapped if needed), but that's stretching it a bit (not to call it cheating). And the "grainsize" parameter is still just a constant, inaccessible to the user of parallel_sort.

I read about the worst-case behaviour in "Structured Parallel Programming", shown on the TBB website. The book shows how to put the largest subrange in the continuation in the Cilk Plus implementation, because Cilk Plus has steal-continuation semantics. However, it then goes on to demonstrate TBB continuation-passing style based on this example, with the biggest subrange in the continuation, instead of simply putting the biggest subrange in the child. Is that inappropriate, or did I overlook something?

The direct relevance for the implementation depends of course on how likely or unlikely it is to encounter such a worst case, accidentally or otherwise (DoS attack?).

21 帖子 / 0 全新

>>... The solution for that would be to recurse on the biggest subrange and to iterate on the smallest subrange...

Some time ago I tried to combine QuickSort ( recursive ) and SelectSort ( iterative ) algorithms using that concept and, unfortunately, it didn't increase performance. Why? Because asymptotic complexity of QuickSort algorithm is better compared to asymptotic complexity of SelectSort. Even overhead of recursive calls doesn't make QuickSort algorithm slower. Of course, when input data set is just four values it doesn't make sence to use QuickSort. When data sets are large, for example greater than 64MB, the fastest sorting algorithms are as follows:

1. MergeSort
2. HeapSort
3. QuickSort

and that list is based on my real tests completed in 2012.

The issue here is possible stack exhaustion (rather than performance).

Without inspecting the code (which might not be safe across versions), perhaps we may assume that, since it is documented that the implementation "executes iterations from left to right" in the absense of worker threads, the subrange being constructed by the splitting constructor, which is assumed to be the rightmost one in this sense, is the one being spawned (so to say)? If such is the case, there should probably be a conditional swap after the partitioning to give it the biggest subrange instead of always the rightmost subrange (details too trivial to submit as a contribution).

There's a whole lot of material about quicksort to be found (DoS is even explicitly mentioned as something to defend against), including adaptations like introsort (to observe progress and, if needed, escape to an algorithm with better worst-case behaviour), but it seems that the implementation should address at least the concern in the book already mentioned above, or else motivate why it does not merit even a trivial concession. A response?

(Being able to use the Range this way (with most of the work done as a side effect of the splitting constructor) is granted only by example in the PDF version of the document I'm still using, not explicitly, which would probably be better.)

>>...The issue here is possible stack exhaustion (rather than performance).

In that case an application always crashes. What about implementation complexity? All these classic sorting algorithms are very simple and small when it comes to size ( that is, number of code lines ).

Citazione:

In that case an application always crashes.

And maybe one could in fact construct a sequence that will reproducibly crash the application even if you have so far been lucky enough not to be dealing with the occasional accident, or with an adversary. Perhaps there does exist a "median-of-medians killer", just like a "median-of-3 killer"?

Hmm, I'm not saying that the implementation is definitely wrong, I'm more wondering about the discrepancy with the book, and that gets compounded by what I pick up from my admittedly superficial research.

The book mentions stack exhaustion for the serial implementation. In TBB, work is put aside on the (infinite) task deque instead of the (finite) stack, and I'll assume that new space can be allocated in amortised constant time. Unless I'm mistaken, in a naive fork-join implementation that may still translate to stack exhaustion. Maybe that problem just goes away because of iteration on the non-spawned side? Then that would be a dependency on the implementation of parallel_for that merits at least a mention in the implementation of parallel_sort (I'm one of those crazies who advocates source code documentation). But it would also make the decision taken in the book to go with continuation-passing style even more mysterious than it already is.

I think I should go over this again to see what I may have missed.

Citazione:

What about implementation complexity? All these classic sorting algorithms are very simple and small when it comes to size ( that is, number of code lines ).

>>
>>I think I should go over this again to see what I may have missed.

I have that book and could you give page numbers to look at? I'd like to review a part you're talking about.

That would be pp. 230-238.

On p. 232 it is mentioned that worst case will recurse to depth N with possible stack overflow, which seems a bit of a stretch, compared to just O(N)? Furthermore, it seems that there's an erratum in the suggested solution, which would be the opposite of what is written here.

For TBB 4.1 update 2 (tbb41_20130116oss_src.tgz), I was wondering why it doesn't use std::partition, which is in C++98. Perhaps because of the asserts, but then why would anybody use std::partition? What does "// Loop must terminate since array[l]==*key0." mean (l is not used)? Why doesn't the first subloop run "while( 1<j && !comp( array[j], *key0 ))" and the second "while( i<j && !comp( *key0, array[i] ))" (or so), which would avoid swapping on equivalent elements, of course with adapted asserts and tests? Why don't I see something in test_parallel_sort.cpp about equivalent elements (those sin() values are all going to be different, for sure)? parallel_sort.h:212 is wrong about using "std::less<RandomAccessIterator>", and parallel_sort.h:219-226 is redundant, i.e., should be removed, because this is already handled by the template with RandomAccessIterator and the use of std::iterator_traits.

I may have further remarks later.

The pre-test for existing order takes into account equivalent elements, which is good. Perhaps not using partition() is good for dynamic behaviour because it will split a range of equivalent elements down the middle, but I don't like that it will have furiously swapped many elements to get there. To avoid pathological behaviour, partition() would have to be followed by adjacent_find() (right?), which makes for a lot of redundant tests compared to bespoke code. And shouldn't either solution then also be used to exclude that pivot-equivalent subrange from the subrange being split-constructed and the subrange left over? If you have mostly equivalent elements, except for a few that upset total order and hence force sorting, such an adaptation should provide a noticeable difference with only a trivial extra change, however unlikely the case may be.

I may have further remarks later, but I wouldn't dismiss some feedback already.

(Added) To avoid swapping with the current code, extra predicate tests are needed to avoid skewing the partition, making the code as costly as partition() plus adjacent_find() anyway, right? That means the question is to either keep the existing code, or use partition() plus adjacent_find() plus excluded pivot-equivalent range. The number of swaps in the first case and the cost of adjacent_find() in the second case grows in the same direction (small for typical inputs, larger for many equivalent elements), but probably extra tests are cheaper than extra swaps, so the second case has a double advantage (simpler and cheaper). Well, there will be more tests than there would be swaps, so the predicate has to be cheap enough. Did I overlook anything?

>>...On p. 232 it is mentioned that worst case will recurse to depth N with possible stack overflow...

Even without looking at the page 232 I could tell that a very important technical detail is missing and it is a Default Stack Size for a test application. Every application has its own stack and it is configurable!

In 2010 I studied that subject ( What is A Depth of the Recursion? ) and created a set of simple tests for different C++ compilers and OSs. I could provide some numbers if interested.

I would prefer a solution that also works well on a 32-bit platform, where stack space is more constrained.

BTW, somehow I used adjacent_find() instead of find_if(bind1st()) in my analysis above (if that's the correct one this time). It already sounded weird to me, but it was a rushed change from an arguably more costly upper_bound(). Sorry for the confusion.

(Added) As long as the ordering predicate inherits from std::binary_function, e.g., "class MinimalCompare : public std::binary_function<Minimal, Minimal, bool>" (!), the new version appears to work, with about a 1% total running-time penalty (probably because of additional comparisons) on test_parallel_sort.exe, which does not have test sequences with lots of equivalent elements. If I add those, the existing algorithm's penalty is in the high single digits overall, and roughly 50% on just the additional time. I didn't pry out which portion of that belongs to the serial comparison sort, and I also haven't tried longer sequences (where the difference should be even more pronounced), but these results seem to confirm my expectations.

>>...That would be pp. 230-238.

Read it and I think that there is nothing wrong with it. If we give a task ( implement some algorithm, for example ) to ten software developers we could see ten different implementations. Also, every IT book assumes that a reader does its own home-work ( thinks, analyzes, tests, etc ) and this is what you've done.

Let's agree to disagree about room for improvement.

In case anybody's interested, I've attached my proposed changes relative to TBB 4.1 update 2.

(2014-01-25 Removed source code again, see latest instead.)

In the patch above, the following would be useful to avoid requiring binary_function-provided features on the Comparator. (Note that C++11 deprecates binary_function, bind1st etc., but that isn't a consideration when writing code that should also work with a pre-C++11 compiler.)

I think that it makes the (rest of the) code more readable, although the line with the find_if is a requirement to avoid worst-case linear complexity as well as an improvement over just cutting such a sequence in half (to get the same benefit in the original code would make it even less readable), and the binder-related code in itself is admittedly more than just a few lines.

```// These replacements for std::bind1st and std::bind2nd avoid requiring binary_function-provided features
// when used with comparators (non-bound-argument type same as bound-argument type, result_type always bool).
template <class Operation, typename T>
class comparator_binder1st : public std::unary_function<T, bool> {
private:
Operation m_op;
T m_bound;
public:
comparator_binder1st(const Operation& op, const T& bound) : m_op(op), m_bound(bound) {}
bool operator()(const T& x) const { return m_op(m_bound, x); }
};
template <class Operation, typename T>
inline comparator_binder1st<Operation, T>
comparator_bind1st(const Operation& op, const T& bound)
{
return comparator_binder1st<Operation, T>(op, bound);
}
template <class Operation,typename T>
class comparator_binder2nd : public std::unary_function<T, bool> {
private:
Operation m_op;
T m_bound;
public:
comparator_binder2nd(const Operation& op, const T& bound) : m_op(op), m_bound(bound) {}
bool operator()(const T& x) const { return m_op(x, m_bound); }
};
template <class Operation, typename T>
inline comparator_binder2nd<Operation, T>
comparator_bind2nd(const Operation& op, const T& bound)
{
return comparator_binder2nd<Operation, T>(op, bound);
}
```

Does anybody have an idea whether the following idea would work well?

Instead of aiming for good balancing by way of eager swapping, as TBB does now, including swapping two elements both equivalent to the pivot value (would an extra comparison to prevent that be useful or not?), perhaps successive rounds could use std::partition with elements equivalent to the pivot alternatingly right or left, speculating that, if the input set has a large number of equivalent elements, the algorithm will quickly arrive at a subrange of just elements equivalent to the pivot, and be able to discard them all together instead of having to subdivide further.

I've seen many variations of quick sort during a superficial web search (even ternary in-place partitioning), but I haven't found any references to a scheme like the one I've described. It seems promising for some simplistic inputs, but those could be deceiving of course. If you happen to have looked into this, please feel free to shatter my delusion. Or, if you get a kick out of such things, I could post the source code here for examination.

I keep getting drawn back to this simple algorithm (my excuse is that a lot of CPU time is spent on sorting), but still shying away from an actual in-depth comparative analysis, hoping that somebody else might be into that (working code is available on demand)?

Anyway, the thing is that the current partitioning step swaps eagerly, i.e., also elements equivalent to the pivot value, which feels suboptimal, and it only leaves out the pivot element from the next round. If all elements are equivalent, O(n*log n) useless swaps are performed!

I tried to leave out more elements for subsequent rounds using alternating std::partition and an explicit scan for equivalent elements, but, despite leveraging an existing algorithm, that becomes a bit of a mouthful.

So how about just swapping reluctantly instead of eagerly, and then also allowing i and j to cross each other to discover more equivalent elements? The range becomes partitioned as follows: <=key, ==key, >=key (this is less aggressive than a possible <key, ==key, >key, but also a whole lot cheaper). This means less swapping and possibly less to take to the next round. But perhaps I overlooked something that would make the (to me) less intuitive eager swapping a better choice after all?

(Added 2013-09-18) This is the most important change:

```const RandomAccessIterator array = range.begin;
const size_t array_size = range.size();

// Partition interval [1,array_size) with key *pivot.
size_t i = 1, j = array_size;
while (true) {
while (i < array_size && !comp(*pivot, array[i]  )) ++i; // <=
while (1 < j          && !comp(array[j-1], *pivot)) --j; // >=
if (!( i + 2 <= j )) break; // cannot be just a single element (see assert below)
std::swap(array[i++], array[--j]);
}

// Now we have 3 mutually ordered subranges.
// The central subrange has locally maximal extent,
// also below after the pivot element is moved to it.
__TBB_ASSERT(1 <= j                   , NULL); // <=
__TBB_ASSERT(     j <= i              , NULL); // ==
__TBB_ASSERT(          i <= array_size, NULL); // >=

// Move the pivot element to the central subrange.
if (1 < j--) std::swap(array[j], *pivot);

begin = range.begin + i;
end = range.end;
range.end = range.begin + j;
```

Another thought: limiting recursion depth and a quick parallel ramp-up seem to be mutually interfering goals. For want of instant knowledge of the current load factor, should the algorithm perhaps favour one or the other based solely on algorithm-local recursion depth? And what should be the strategy without such information? Note that quick ramp-up can still be achieved without eager swapping if left and right are grown alternatingly until they meet, so that wouldn't be the objection that it now is. I don't remember reading about any of that before... funny how such a staple algorithm continues bringing surprises. But on to something else now, before I find myself profiling the various alternatives.

Any ideas?

Aha, finally nailed it!

With semi-eager swapping (never swap elements both equivalent to pivot as in the current implementation, but don't be entirely lazy either) and equivalence discovery delegated to the next round, I'm getting fairer division than now, same quick ramp-up, and no waste of knowledge relative to the previous round's pivot (much better than now, without lots of superfluous swapping).

It's only relevant when many elements are mutually equivalent, but that does happen...

Here are the changes. The price is right, but the payoff depends on the prevalence of inputs with lots of mutually equivalent items. One takeaway is that the optimization for parallel execution is essentially different from that for sequential execution: fair distribution of work, and instead of quickly discovering lots of equivalent elements during the current round it pays to delegate that to the next round... I think. There's also some attention for not actualising potential parallelism: the cutoff for potential parallelism is kept at 500 elements, but I don't think it pays to spawn a task if one of the parts is only a small fraction of that. And even with some of those features disabled I think it's an improvement, as described above. Feedback appreciated!

(2014-01-25 Removed source code again, see latest instead.)

Another idea: why not leave the pivot element in place instead of always swapping it with the first element? The goal is zero writes (swaps) for already-sorted subranges, and it seems to work, but I have not timed it yet, so I don't know whether it's actually useful.

I find this amazing: quicksort must have been analysed to death by now, and I've looked at quite a few implementations, so why exactly am I still discovering things I don't remember ever having seen before, with this little cherry on top? Am I just delusional, or really onto something? :-)

BTW, does anyone know how std::sort stacks up? Will it also process an already-sorted sequence without writes?

Hmm, it seems that std::sort algorithmic complexity requirements have been tightened up in C++11, so the next step might be an application of something like introsort here as well...

Feedback welcome!

(2013-12-03 Added) Now with source code.

(2014-01-25 Removed source code again, see latest instead.)

The creative part was all in good fun, but I don't like the measuring part quite so much, even to prove a point. Of course there's also the chance of finding out that it was all for naught, or worse.

But maybe somebody else would like to pick this apart, to see what it does with lots of equivalent elements and preexisting (partial) order?

(2013-12-15 Added) Hmm, strange… That lazy swapping didn't give a noticeable difference (maybe it would with types that are more expensive to swap?), but what did improve performance (well, by a few percent anyway) was reducing the grainsize to crazy-low values (don't forget swapping out the auto_partitioner for a simple_partitioner), until performance finally declined again with grainsize in the single digits. Go figure!

I thought I'd just replace a few earlier versions with the latest.

Hopefully at least the skeleton changes will make it into TBB: avoiding redundant swapping in the main loop, and always swawning the largest piece.