Continued from "Conditional parallelization".

Hmm, apparently parallel_sort is currently implemented with parallel_for and a weird-looking quick_sort_range class that does the partitioning... and has a "grainsize" parameter. That's either really neat, because it follows naturally from the quicksort algorithm and uses only high-level constructs, or problematically naive man-with-a-hammer code, because the worst case may recurse to depth N (or so I hear, see below). The solution for that would be to recurse on the biggest subrange and to iterate on the smallest subrange. This could of course still be implemented in terms of quick_sort_range and intimate knowledge of which side will be delegated to a child task (with the ranges swapped if needed), but that's stretching it a bit (not to call it cheating). And the "grainsize" parameter is still just a constant, inaccessible to the user of parallel_sort.

I read about the worst-case behaviour in "Structured Parallel Programming", shown on the TBB website. The book shows how to put the largest subrange in the continuation in the Cilk Plus implementation, because Cilk Plus has steal-continuation semantics. However, it then goes on to demonstrate TBB continuation-passing style based on this example, with the biggest subrange in the continuation, instead of simply putting the biggest subrange in the child. Is that inappropriate, or did I overlook something?

The direct relevance for the implementation depends of course on how likely or unlikely it is to encounter such a worst case, accidentally or otherwise (DoS attack?).