Observations from Parallel Sorting Part I: the subtleties of tbb::parallel_reduce

In this series of postings, I discuss two common sorting algorithms, mergesort and quicksort, and highlight some of the interesting issues that arise when creating parallel implementations using TBB.  In all cases we’ll assume that, like STL’s std::sort, the input to the sort is a pair of RandomAccessIterators begin and end that denote the beginning and ending of the collection to be sorted, and a comparator object compare that provides an operator() for ordering two given elements.

In this post, I’ll give a quick overview of sequential mergesort and then I’ll begin with a false start in parallelizing it using tbb::parallel_reduce.

 Sequential Mergesort 

Mergesort, like quicksort, is a recursive algorithm.  A sequential implementation might look something like:

template < typename RandomAccessIterator,   typename Compare >void MergeSort(RandomAccessIterator begin, RandomAccessIterator end,                              const Compare &compare){       if ( end – begin > 1 ) {               RandomAccessIterator midpoint = begin + (end – begin) / 2;               MergeSort( begin, midpoint, compare );               MergeSort( midpoint, end, compare );               InPlaceMerge( begin, midpoint, end, compare );       }} 

The stopping condition for the recursion is a single element, which is a trivially sorted collection.  Otherwise, the collection is divided into roughly equally sized halves and each half is recursively sorted using MergeSort.  At each step the sorted halves are merged into a single sorted collection.  Mergesort has a computational complexity of O( n lg n), since there are O(lg n) levels of recursion and O(n) comparisons needed to the merges at each level of the recursion tree. 

A quick example of mergesort for an array {B, C, D, A} is shown below:

The downward phase:

{B, C, D,  A} → {B, C}, {D, A} → {B}, {C}, {D}, {A}

The upward merging phase:

{B}, {C}, {D}, {A} → {B, C}, {A, D} → {A, B, C, D}

 So how can we parallelize this algorithm using TBB?

It’s clear from the sequential algorithm above that the MergeSort calls at the same level of the recursive tree are independent; they read and write distinct elements in the collection.  For example, the MergeSort of {B, C} is independent of the Mergesort of {D, A}.   It’s also possible that there may be exploitable parallelism within the InPlaceMerge function – we’ll get back to that later. For now, let’s start with the calls to MergeSort.

Before hand-coding something using the TBB task scheduler, it’s always a good idea to see if one of the existing generic algorithms can be leveraged.  The generic algorithms have been tested for correctness and use scheduling optimizations, like continuation passing and scheduler bypass, to maximize performance.  Using an existing algorithm can often reduce debugging time.

The algorithm that looks like the closest match to mergesort is tbb::parallel_reduce.  Like tbb::parallel_for, tbb::parallel_reduce takes a Range argument and a Body argument.  But in addition to an execute method, the tbb::parallel_reduce Body also requires a join method that is used to combine (perhaps merge?) multiple body objects on the way back up the tree. So, let’s see if we can cast mergesort as a tbb::parallel_reduce…

The Range could be a blocked_range<int>(0, size_of_collection, 1).  The recursive splitting of this Range by the tbb::parallel_reduce would create smaller sub-ranges on the way down, stopping with trivial ranges of 1 element. There is no work needed beyond the splitting of the Range on the way down the recursion tree, so the Body for our tbb::parallel_reduce would have an empty execute method. 

On the way back up the tree, the sub-arrays must be merged. So perhaps we can use the Body’s join method to do the merging. Will this work?

Unfortunately for us, no, it won’t work.  A quick review of the tbb::parallel_reduce description in the Reference manual tells us why:

“When worker threads are available (8.2.1), parallel_reduce invokes the splitting constructor for the body.  For each such split of the body, it invokes the method join in order to merge the results from the bodies… A body is split only if the range is split, but the converse is not necessarily so.”

What this means is that the join method is only called as necessary to merge results from different threads, and cannot be depended upon to merge the results of each sub-range.  For normal applications of a reduction, such as finding a sum or minimum, this is an important optimization.  But it makes using tbb::parallel_reduce a non-starter for an easy implementation of mergesort.

In my next post, I’ll discuss an implementation of mergesort that uses the task scheduler API directly – as well as a tbb::parallel_for in the merge function.