This article describes a parallel merge sort code, and why it is more scalable than parallel quicksort or parallel samplesort. The code relies on the C++11 “move” semantics. It also points out a scalability trap to watch out for with C++. The attached code has implementations in Intel® Threading Building Blocks (Intel® TBB), Intel® Cilk™ Plus, and OpenMP*.
For more detailed analysis of parallel quicksort, samplesort, and merge sort, see the book Structured Parallel Programming (by Michael McCool, James Reinders, and me). I’ve also provided numerous links in this article to background material.
I wrote the original version of the parallel quicksort that ships with Intel TBB. At the time, I knew that theoretically it did not scale as well as some other parallel sorting algorithms. But in practice, for the small core counts common when Intel TBB first shipped, it was usually the fastest sort, because it was in place. The other contending sorts required temporary storage that doubled the memory footprint. That bigger footprint, and the extra bandwidth that it incurred, clobbered theoretical concerns about scalability. Furthermore, C++11 move semantics were not standardized yet (though many developers had hacked their own idiosyncratic versions), thus moving objects was sometimes expensive, which hurt alternative sorts that required moving objects. Now that more hardware threads are common (indeed a Intel® Xeon Phi™ coprocessor has 240 on a chip), and C++11 move semantics are ubiquitously available, fundamental scalability analysis comes back into play.
Why Parallel Quicksort Cannot Scale
Parallel quicksort operates by recursively partitioning the data into two subsets and sorting the subsets in parallel. The first level partitioning uses one thread, the second level uses 2 threads, the third 4 threads, then 8, 16 and so forth. This exponentially growing parallelism sounds great. But work-span analysis shows that parallel quicksort cannot scale.
Work-span analysis is the foundation for analyzing parallel algorithms. Let TP denote the time to execute an algorithm on P threads on an ideal machine (infinite available hardware threads, infinite memory bandwidth). The work for an algorithm is T1, which is the serial execution time. The span is T∞, the execution time on an infinite number of processors.For a parallel quicksort, T1=Θ(n lg n) and T∞=Θ(n). (Θ is like "big O", except that it denotes asymptotic upper and lower bounds. We need both bounds because we'll be computing asymptotic quotients.) The latter bound arises because the first partitioning step is serial and thus gets no benefit from having more than one thread.
The available parallelism is T1/T∞,in other words the maximum achievable speedup on an ideal machine. Using more threads than T1/T∞can’t help much, even if you are lucky enough to have an ideal machine. For parallel quicksort, T1/T∞=Θ(lg n). Thus the parallelism is about 30 if sorting a billion keys. That's well short of the parallelism available on a 240-thread Intel Xeon Phi coprocessor.
Sample Sort’s Achille’s Heel
Parallel samplesort overcomes the shortcoming of parallel quicksort by parallelizing the partitioning operation, and doing a many-way partitioning instead of a two-way partitioning. The analysis is a bit complicated (see the book), but the net result is that parallel samplesort scales nicely on an ideal machine. Unfortunately, on real machines, threads are not the only resource of concern. The memory subsystem can be the limiting resource. In particular, a many-way partitioning generates many streams of data to/from memory. When each stream occupies at least a page (commonly the case for a big sort), each stream will need an entry in theTranslation Lookaside Buffer (TLB). Having more streams then entries causes the TLB to thrash. This is not to say that samplesort is hopeless. My experience has been that samplesort does very well as long as enough TLB capacity exists, which it typically does for multi-core machines. Alas at the extremes of many-coremachines such as Intel Xeon Phi coprocessor, TLB capacity becomes a problem for samplesort.
Parallel Merge Sort
Parallel merges sort works by recursively sorting each half of the data, and then merging the two halves. The two subsorts can be done in parallel, and the merge can be done in parallel too, using a divide-and-conquer tactic. The parallel merge works like this: Given two sorted sequences to merge:
- Let k be the middle key of the longer sequence
- Find k in the shorter sequence, using binary search.
- Split each sequence into two parts, using k as the split point. To get a stable sort, some care has to be taken as to which part k ends up in. For details, see where the code uses std::lower_bound and std::upper_bound.
- Recursively merge the first part of each sequence. Then recursively merge the second part of each sequence. These two sub-merges can be done in parallel.
Parallel merge sort is attractive because it can be written without any Θ(n) bottlenecks, is cache-oblivious, and has nice memory streaming behavior. It has T1=Θ(n lg n) and T∞=Θ(lg3 n), thus its parallelism is T1/T∞ = Θ(n / lg2 n). Thus the parallelism is on the order of a million if sorting a billion keys. Even for sorting just a million keys, the parallelism will be on the order of 2500.
However, there is scalability trap to avoid if using C++. Parallel merge sort requires a temporary buffer. The buffer objects must be initialized (default-constructed) and destroyed. For types with trivial constructors/destructors (such as int and float) , these operations take zero time. But for types such as std::string, these operations take time, and so constructing or destroying the buffer serially raises the span to Θ(n), clobbering parallelism back to the same order as parallel quicksort.
A simple solution is to construct (or destroy) the buffer objects in parallel. But doing so introduces more scheduling overhead, and has poor locality with respect to uses of those objects. A more elegant approach is to construct the buffer objects at the leaves of the merge sort recursions, and destroy them at the leaves of the parallel merge recursions. That way the work for construction/destruction is distributed across the threads, with good locality since the first construction almost immediately precedes first use, and the last use almost immediately precedes destruction.
Ready to Use Code
The attachment has four versions of the code, and a test. The code requires support for C++11 “move” semantics. All versions are not exception safe if the keys have non-trivial destructors. Making them exception safe would add much complexity, and in most applications, the operations on the keys are not going to throw exceptions, particularly since keys are relocated using move operations instead of copy construction or assignment.
The four versions, listed from highest level to lowest-level expression are written in:
- High-level TBB – uses tbb::parallel_invoke. It's so easy to translate to Intel Cilk Plus or OpenMP that I've left that as an exercise..
- Cilk Plus – similar to high-level TBB, but it converts a spawned “tail call” to a while loop to save some cycles.
- OpenMP 3.0 – practically a clone of the Cilk Plus version. The caller must invoke it in a parallel region to get any parallelism.
- Low-level TBB – uses tbb::task to get the benefits of Cilk-like "parent-stealing” scheduling.
The test program checks that keys are sorted, the sort is stable, and that no objects are leaked or used incorrectly.
For exposition, all four versions share a common header pss_common.h. The top-level routine ispss::parallel_stable_sort, which has two overloads similar tostd::stable_sort. One of the overloads is in the common header. If using the code in a production environment, I suggest choosing one version, incorporating the content of pss_common.h directly into it, and renaming the namespace pss to whatever suits your fancy. You may also want to add a traditional single-inclusion #ifndef guard.
I've left out the obligatory performance/scaling graphs, because the performance is dependent on the hardware and key type. So try it on your own favorite dataset. I've been happy with it, particularly for a sort written in less than 150 lines of code.
Notes on the OpenMP Version
The OpenMP version demonstrates a generally useful trick when using OpenMP tasking. Tasking was grafted onto OpenMP well after it was founded on the notion of parallel regions, which creates a problem for a routine that uses OpenMP tasking, because there are two contexts to consider:
- It's not invoked in a parallel region, and thus needs to create one to get any parallelism.
- It's invoked in a parallel region, and thus should not create a nested one, because one of two bad things would happen:
- Nested parallelism is disabled (default in OpenMP), in which case the parallel region would have no parallelism.
- Nested parallelism is enabled. The sort would end up running in a nested parallel region, which tends to perform poorly in my experience.
The code solves the problem by conditionally creating a parallel region and using the master thread to start the sort, as shown below:
if( omp_get_num_threads() > 1) internal::parallel_stable_sort_aux( xs, xe, (T*)z.get(), 2, comp ); else #pragma omp parallel #pragma omp master internal::parallel_stable_sort_aux( xs, xe, (T*)z.get(), 2, comp );
While translating the sort to OpenMP, I discovered (and reported) a bug in the Intel OpenMP implementation of firstprivate for parameters with non-trivial copy constructors. The code has a work-around for the issue (look for __INTEL_COMPILER in openmp/parallel_stable_sort.h to find it).
Andrey Churbanov diagnosed the nature of the Intel OpenMP problem and suggested the work-around. Alejandro Duran pointed out the trick of conditionally creating a parallel region.
*The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board