# Parallel sorts for Cilk Plus

By Arch D. Robison (Intel), published on April 3, 2013

This article describes the parallel sorts in the latest release of “Cilkpub”, an open-source library of utilities for Intel® Cilk™ Plus.

`cilk_sort`

`cilk_sort_in_place`

They are designed to be replacements for `std::sort`

that may provide speedup when sorting many items (on the order of at least 10000). For example:

extern float a[]; cilkpub::cilk_sort( a, a+n );

sorts elements a[0], a[1], ... a[n-1] in ascending order and

cilkpub::cilk_sort( a, a+n, std::greater<float>() )

sorts the elements in descending order.

The comparison and swap/move operations on the items must be safe to call concurrently on separate items from separate threads. These operations will never be called simultaneously from separate threads on the same item. That is usually the case for those operations unless there is hidden shared state.

## Changes from Textbook Versions

The routines were adapted from examples in the book Structured Parallel Programming by Michael McCool, James Reinders, and myself. Since the textbook addressed C programmers too, we generally avoided templates and STL-style iterators in the book examples. But since Cilkpub is for real-world use, Jim Sukha and I revised the code to match STL expectations. In particular, the routines are templatized identically to `std::sort`

. The notable improvements from the textbook versions are:

- sequences are specified by random-access iterators, not just pointers.
- the comparison operation can be a functor, or default to using “<”, as the ordering predicate.
- C++11 rvalue references and "move" operations are exploited if available.

## Guidance on Which Sort to Use

No serial sort is perfect for all occasions. Likwewise the two parallel sorts each have strengths over the other. The routine `cilk_sort`

generally scales better than `cilk_sort_in_place`

as the number of processor cores increases, but `cilk_sort`

requires internal buffer space. For sorting N items of type T, the buffer space is (1+sizeof(T))N bytes. If the space is not available, `cilk_sort`

falls back on an in-place parallel sort. For small number of cores (typically four or fewer), `cilk_sort_in_place`

is often as fast or faster than cilk_sort, because it does not have to copy data and thus has a smaller cache footprint.

The current implementation of `cilk_sort`

is not completely exception safe. When sorting items of type T, it copies the items into the temporary buffer using move constructors. If an exception is thrown during the sort, the space for the copies is deallocated, but the destructors are *not* called for the copies. Hence if T has a non-trivial destructor, and operations on it during sorting might throw an exception, use `cilk_sort_in_place`

instead, or use`cilk_sort`

to sort an array of pointers to the actual items. Or equivalently, sort an array of `std::reference_wrapper`

.

With some additional effort, `cilk_sort`

could be made fully exception safe. Contributions welcome.

## Work-Span Analysis

Work-span analysis lets us estimate the potential speedup on an ideal machine where memory bandwidth is not a limitation.

- The
*work*is the time to execute an algorithm on one processor. - The
*span*is the time on an infinite number of processors.

The parallelism of an algorithm is *work/span*. The book goes into the detailed analysis of the sorts. I'll summarize the key points here. Both routines have worst-case work and span of Θ(N^{2}), and hence O(1) parallelism in the worst case. They differ in the average case. `cilk_sort_in_place`

is a parallel quicksort with average-case work complexity Θ(N lg N). Its average-case span is Θ(N), a consequence of using serial code to do the partitioning step. Hence `cilk_sort_in_place`

has only Θ(lg(N)) parallelism in the average case.

Routine `cilk_sort`

is a sample sort, which avoids the serial partitioning bottleneck. Sample sort's average case work is likewise Θ(N lg N). Its average-case span is a bit messy to analyze, but for practical purposes it's Θ(lg N), which arises from the step that does a binary search to determine which bin to put an item into. Hence for the average case there is plenty of parallelism. As a practical matter, `cilk_sort`

for large N will run into memory bandwidth limitations long before it runs out of parallelism.

If the worst-case Θ(N^{2}) performance is unacceptable, consider using the parallel merge sort described in Chapter 13 of the book, which has worst-case work Θ(N lg N) and worst-case span Θ(lg^{3} N).

## About Cilk Plus

For more information about Intel Cilk Plus, see the website http://cilkplus.org. For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.