Parallel sorts for Cilk Plus

This article describes the parallel sorts in the latest release of “Cilkpub”, an open-source library of utilities for Intel® Cilk™ Plus. 

  • cilk_sort
  • cilk_sort_in_place

They are designed to be replacements for std::sort that may provide speedup when sorting many items (on the order of at least 10000). For example:

extern float a[];
cilkpub::cilk_sort( a, a+n );

sorts elements a[0], a[1], ... a[n-1] in ascending order and

cilkpub::cilk_sort( a, a+n, std::greater<float>() )

sorts the elements in descending order.

The comparison and swap/move operations on the items must be safe to call concurrently on separate items from separate threads. These operations will never be called simultaneously from separate threads on the same item. That is usually the case for those operations unless there is hidden shared state.

Changes from Textbook Versions

The routines were adapted from examples in the book Structured Parallel Programming by Michael McCool, James Reinders, and myself.  Since the textbook addressed C programmers too, we generally avoided templates and STL-style iterators in the book examples.  But since Cilkpub is for real-world use, Jim Sukha and I revised the code to match STL expectations.  In particular, the routines are templatized identically to std::sort. The notable improvements from the textbook versions are: 

  • sequences are specified by random-access iterators, not just pointers.
  • the comparison operation can be a functor, or default to using “<”, as the ordering predicate.
  • C++11 rvalue references and "move" operations are exploited if available.

Guidance on Which Sort to Use

No serial sort is perfect for all occasions.  Likwewise the two parallel sorts each have strengths over the other.  The routine cilk_sort generally scales better than cilk_sort_in_place as the number of processor cores increases, but cilk_sort requires internal buffer space. For sorting N items of type T, the buffer space is (1+sizeof(T))N bytes. If the space is not available, cilk_sort falls back on an in-place parallel sort. For small number of cores (typically four or fewer), cilk_sort_in_place is often as fast or faster than cilk_sort, because it does not have to copy data and thus has a smaller cache footprint.

The current implementation of cilk_sort is not completely exception safe. When sorting items of type T, it copies the items into the temporary buffer using move constructors. If an exception is thrown during the sort, the space for the copies is deallocated, but the destructors are not called for the copies. Hence if T has a non-trivial destructor, and operations on it during sorting might throw an exception, use cilk_sort_in_place instead, or usecilk_sort to sort an array of pointers to the actual items. Or equivalently, sort an array of std::reference_wrapper.

With some additional effort, cilk_sort could be made fully exception safe. Contributions welcome.

Work-Span Analysis

Work-span analysis lets us estimate the potential speedup on an ideal machine where memory bandwidth is not a limitation. 

  • The work is the time to execute an algorithm on one processor. 
  •  The span is the time on an infinite number of processors.  

The parallelism of an algorithm is work/span.  The book goes into the detailed analysis of the sorts.  I'll summarize the key points here.  Both routines  have worst-case work and span of Θ(N2), and hence O(1) parallelism in the worst case.  They differ in the average case.  cilk_sort_in_place is a parallel quicksort with average-case work complexity Θ(N lg N).  Its average-case span is Θ(N), a consequence of using serial code to do the partitioning step.   Hence cilk_sort_in_place has only Θ(lg(N)) parallelism in the average case.

Routine cilk_sort is a sample sort, which avoids the serial partitioning bottleneck.  Sample sort's average case work is likewise Θ(N lg N).  Its average-case span is a bit messy to analyze, but for practical purposes it's Θ(lg N), which arises from the step that does a binary search to determine which bin to put an item into.  Hence for the average case there is plenty of parallelism.  As a practical matter, cilk_sort for large N will run into memory bandwidth limitations long before it runs out of parallelism.

If the worst-case Θ(N2) performance is unacceptable, consider using the parallel merge sort described in Chapter 13 of the book, which has worst-case work Θ(N lg N) and worst-case span Θ(lg3 N).

About Cilk Plus

For more information about Intel Cilk Plus, see the website http://cilkplus.org.  For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.

For more complete information about compiler optimizations, see our Optimization Notice.