# Observations from Parallel Sorting Part I: the subtleties of tbb::parallel_reduce

By Michael V. (Intel), Updated

In this series of postings, I discuss two common sorting algorithms, mergesort and quicksort, and highlight some of the interesting issues that arise when creating parallel implementations using TBB. In all cases we’ll assume that, like STL’s std::sort, the input to the sort is a pair of RandomAccessIterators begin and end that denote the beginning and ending of the collection to be sorted, and a comparator object compare that provides an operator() for ordering two given elements.

In this post, I’ll give a quick overview of sequential mergesort and then I’ll begin with a false start in parallelizing it using tbb::parallel_reduce.

**Sequential Mergesort**** **

Mergesort, like quicksort, is a recursive algorithm. A sequential implementation might look something like:

The stopping condition for the recursion is a single element, which is a trivially sorted collection. Otherwise, the collection is divided into roughly equally sized halves and each half is recursively sorted using MergeSort. At each step the sorted halves are merged into a single sorted collection. Mergesort has a computational complexity of O( n lg n), since there are O(lg n) levels of recursion and O(n) comparisons needed to the merges at each level of the recursion tree.

A quick example of mergesort for an array {B, C, D, A} is shown below:

The downward phase:

{B, C, D, A} → {B, C}, {D, A} → {B}, {C}, {D}, {A}

The upward merging phase:

{B}, {C}, {D}, {A} → {B, C}, {A, D} → {A, B, C, D}

**So how can we parallelize this algorithm using TBB? **

It’s clear from the sequential algorithm above that the MergeSort calls at the same level of the recursive tree are independent; they read and write distinct elements in the collection. For example, the MergeSort of {B, C} is independent of the Mergesort of {D, A}. It’s also possible that there may be exploitable parallelism within the InPlaceMerge function – we’ll get back to that later. For now, let’s start with the calls to MergeSort.

Before hand-coding something using the TBB task scheduler, it’s always a good idea to see if one of the existing generic algorithms can be leveraged. The generic algorithms have been tested for correctness and use scheduling optimizations, like continuation passing and scheduler bypass, to maximize performance. Using an existing algorithm can often reduce debugging time.

The algorithm that looks like the closest match to mergesort is tbb::parallel_reduce. Like tbb::parallel_for, tbb::parallel_reduce takes a Range argument and a Body argument. But in addition to an execute method, the tbb::parallel_reduce Body also requires a join method that is used to combine (perhaps merge?) multiple body objects on the way back up the tree. So, let’s see if we can cast mergesort as a tbb::parallel_reduce…

The Range could be a blocked_range<int>(0, size_of_collection, 1). The recursive splitting of this Range by the tbb::parallel_reduce would create smaller sub-ranges on the way down, stopping with trivial ranges of 1 element. There is no work needed beyond the splitting of the Range on the way down the recursion tree, so the Body for our tbb::parallel_reduce would have an empty execute method.

On the way back up the tree, the sub-arrays must be merged. So perhaps we can use the Body’s join method to do the merging. Will this work?

Unfortunately for us, no, it won’t work. A quick review of the tbb::parallel_reduce description in the Reference manual tells us why:*“When worker threads are available (8.2.1), parallel_reduce invokes the splitting constructor for the body. For each such split of the body, it invokes the method join in order to merge the results from the bodies… A body is split only if the range is split, but the converse is not necessarily so.”*

What this means is that the join method is only called as necessary to merge results from different threads, and cannot be depended upon to merge the results of each sub-range. For normal applications of a reduction, such as finding a sum or minimum, this is an important optimization. But it makes using tbb::parallel_reduce a non-starter for an easy implementation of mergesort.

In my next post, I’ll discuss an implementation of mergesort that uses the task scheduler API directly – as well as a tbb::parallel_for in the merge function.

## 5 comments

Topchristian.roessel said on Jan 19,2010

Dear Mr. Voss,

I just read your TBB paper "Ready for non-uniform memory access platforms" http://isdlibrary.intel-dispatch.com/vc/2724/treadingbuildingblocks_110609.pdf where you refer to a Black-Scholes benchmark. Is it possible to get the source code and the datasets that were used for your measurements?

Thanks and best regards,

Christian Rössel

oppal.einstein said on Oct 3,2009

can you send me the c++ code for this?.

oppal.einstein said on Oct 3,2009

Sir,

Is It possible to implement merge sort using parallel_reduce ?.

please reply as soon as possible...

oppal.einstein said on Oct 3,2009

sir,

can you tell me how to parallel "Merge sort" using TBB.

please reply as soon as possible.

Anonymous said on Oct 2,2009

Sir,

as you mentioned that merging in "join" function will not work then how will we implement it i mean how will we implement merge sort by Tbb.

## Add a Comment

Sign inHave a technical question? Visit our forums. Have site or software product issues? Contact support.