TBB's parallel_sort is 5x slower on Xeon Phi 5110P, than on E5-2670 ??

TBB's parallel_sort is 5x slower on Xeon Phi 5110P, than on E5-2670 ??

Dear all,

I'm benchmarking accelerated libraries for key-value sorting. Intel mentions parallel_sort in hundreds of tutorials, but they never go further than just mentioning - almost no compilable code examples for MIC, and no perf numbers, except 1 slide here. As result, I had to write my own sample (attached). Surprisingly, for me Xeon Phi 5110P processes 256M elements 5 times slower than Xeon CPU. I doubt it's normal.

May I kindly ask the members of community to try reproducing my timing results and suggest optimization options, if I missed something important. Below are my numbers.

Many thanks!

- D.

$ make sbk_aos_tbb.cpu
icpc -I/apps/dommic/intel/tbb/include -ipo -O3 -no-prec-div -xHost -DCPU sbk_aos_tbb.cpp -o sbk_aos_tbb.cpu -lrt -L/apps/dommic/intel/tbb/lib/intel64/ -ltbb -Wl,-rpath=/apps/dommic/intel/tbb/lib/intel64/ -Wl,-rpath=/apps/dommic/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64
$ ./sbk_aos_tbb.cpu $((1024*1024*256))
Init time = 0.000000 sec
Load time = 0.000000 sec
Sort time = 4.392968 sec
Save time = 0.000000 sec
$ make sbk_aos_tbb.mic.native
icpc -I/apps/dommic/intel/tbb/include -mmic -openmp -O3 -DCPU sbk_aos_tbb.cpp -o sbk_aos_tbb.mic.native -lrt -L/apps/dommic/intel/tbb/lib/intel64//../mic -tbb -Wl,-rpath=/apps/dommic/intel/tbb/lib/intel64//../mic -Wl,-rpath=/apps/dommic/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/../mic
$ ssh mic0
mikushin@dom36-mic0:~$ cd /users/mikushin/forge/kernelgen/doc/hpcac_swiss_2014/gpu_intro/sdk/sort_by_keymikushin@dom36-mic0:~/forge/kernelgen/doc/hpcac_swiss_2014/gpu_intro/sdk/sort_by_key$ LD_LIBRARY_PATH=. ./sbk_aos_tbb.mic.native $((1024*1024*256))Init time = 0.000001 sec
Load time = 0.000000 sec
Sort time = 20.393966 sec
Save time = 0.000000 sec


Downloadtext/x-c++src sbk_aos_tbb.cpp2.43 KB
Downloadtext/plain makefile_1.txt1.24 KB
15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If in fact there is no vectorization, It's difficult for MIC to match host performance (note VTune screenshot).

Visualizing your case with micsmc-gui, it's apparent that the threaded portion of execution takes about 4 seconds (on KNC B0).

It looks like the server crashes when I attempt to upload the screenshot.


Hi Tim!

That's very interesting! I wonder what the non-threaded part of parallel_sort is doing? Is there something else taking 20 - 4 = 16 sec?

Could you please upload image somewhere, e.g. to Google here: https://plus.google.com/100863165845515570963/posts/gMUqG4CQ1AA

Thanks a lot,

- D.

vtune shapshot indicates allocation of parallel events in tbb library


Downloadimage/png Screenshot-2_1.png244.6 KB

Thanks for your testing, Tim, very insightful!

Looks like the task scheduler is experiencing something very unfortunate. And this is inside libtbb.so compiled by Intel. I wonder if Intel guys could comment on this...

- D.

Picking out individual threads, hoping to find the critical path, the scheduler may not be occupying as much as 50% of critical path time.  Then the time spent in the compilation derived from tbb headers appears to be significant, but gives no evidence of effective vectorization.

It's not surprising to find large event counts associated with idle spinning threads (if that is in fact the case; I'm not familiar with tbb). 

Best Reply

The fundamental problem is that parallel quicksort is not scalable.  For sorting N items, the total work for quicksort is Θ(N lg N) in the average case.  The span (critical path) is Θ(N) in the average case, because the partition operations are serial.  Parallelism is work/span, and hence Θ(lg N) for parallel quicksort.  Speedup suffers if the parallelism is less than the hardware parallelism.  For example, if sorting 256,000,000 items, the parallelism is only lg(256,000,000), about Θ(28).  I'm guessing the Θ hides a small constant factor.

So why did TBB start with parallel quicksort?  Because it was the fastest sort available for the small core counts when TBB came out.  Because parallel quicksort is an in-place sort, the smaller memory footprint trumped other concerns when core counts were small.  It also had requirements identical to std::sort, which made it an easy drop-in replacement.  [Full disclosure: I was the author of the original tbb::parallel_sort.]

Now core counts are much larger, and tbb::parallel_sort should probably be augmented with more scalable sorts that are not in-place and not drop-in replacements for std::sort.  For example, parallel merge sort and parallel sample sort have much better scalability, but require temporary arrays, so they are not drop-in replacements for std::sort because different signatures are required. 

The download page for the book Structured Parallel Programming [I'm one of its authors] has a bunch of sorts written in TBB and Cilk Plus, and even one in OpenMP.  The download link is http://parallelbook.com/sites/parallelbook.com/files/code20131121.zip .  To run the sorts on MIC:

  1. Open up code/config/config.inc in an editor.
  2. Uncomment the second line: "CPLUS = icpc".
  3. On line 17, for the definition "CPLUS_FLAGS = ...", change -xHost to -mmic 
  4. Save code/config/config.inc
  5. cd to code/src/sort/build
  6. Run "make".  It will build ./test_sort.x and then attempt to run it.  Of course it won't run on the host.
  7. Run ./test_sort.x on an Intel(R) Xeon Phi(TM) coprocessor.  The program takes two optional arguments: the number of sorts to run and the length of each sort.  

Below is the result from a sample run that I did this morning on a :Intel(R) Xeon Phi(TM) coprocessor.  The numbers in the table are times in seconds.

[adrobiso@fxe16lin06-mic0 adrobiso]$ ./test_sort.x 2 40000000
Testing for 2 sorts of length 40000000 for uniform distribution of int
                      STL sort  40.71
      Cilk quicksort recursive   4.30
 Cilk quicksort semi-recursive   4.32
                Cilk mergesort   0.59
               Cilk samplesort   0.88
       TBB quicksort taskgroup   4.57
    TBB quicksort continuation   4.58
                 TBB mergesort   0.57
                TBB samplesort   1.00
              OpenMP mergesort   0.67

I'm a bit surprised that the mergesort beat the samplesort.  I'm guessing that it's some quirk of the memory system.  Sample sort generates many streams of data, which could cause TLB issues.

I am not claiming that using a different sort will make the coprocessor beat a big-core machine, merely that a more scalable sort may help the coprocessor.  As Tim mentioned, the lack of vectorization opportunities puts the coprocessor at a relative disadvantage.

- Arch


Tim Prince wrote:

It's not surprising to find large event counts associated with idle spinning threads (if that is in fact the case; I'm not familiar with tbb). 

The top "hotspot" place in the shown picture is indeed due to idle spinning of unoccupied TBB worker threads. The function task_stream::pop() is supposed to get an enqueued task if such exists; however, parallel_sort does not use enqueued tasks.

This is a good example to learn that what works best in one circumstance does not necessarily work best in a different circumstance.

Although I haven't programmed this, conceptually a modified merge sort can be done in place. Typically a merge sort consists of merging saw tooth data sets: N ordered teeth -> N/2 ordered teeth. The teeth can be same size or differ in size (e.g. polyphase merge sort where the sizes vary in a Fibonacci series).The teeth can be individual files (memory buffers) or distinct sections of a single file (or memory buffer). It is simpler to write the algorithm using multiple files (buffers).

With a little extra overhead, one can merge two saw teeth in place. If (when) the storage location in the lower address tooth is occupied by an unmerged item from this pass, there will be available an empty slot (prior merged) in the second (higher addressed) tooth for use as temporary storage. This will induce additional overhead but at the benefit of being an in-place sort (assuming a memory limitation eliminates using multi-buffer).

Jim Dempsey

Dear all,

Thank you for your helpful comments! After switching to mergesort I'm able to sort 256M <float, float> pairs in 3.21 sec on Xeon E5-2670 and in 4.64 sec on Xeon Phi 5110P. The remaining difference is I guess due to the mentioned lack of vectorization. Attached is the new code, just for reference. Btw, something (C++11?) broke the offload mode - only native MIC now works fine...

I hope you will agree that a routine called "parallel_sort" implicitly presumes "the best sorting algorithm on the target architecture". Making it instead to mean "parallel version of STL sort" is quite misleading and should be at least mentioned in documentation.


- D.


Downloadtext/plain makefile_2.txt1.27 KB
Downloadtext/x-c++src sbk_aos_tbb_0.cpp4.41 KB

I've posted an article explaining the scaling issue along with an improved version of the merge sort.  The improvement avoids a serial bottleneck when the key type has a non-trivial constructor or destructor.


I made this request to Kathy Farrel, and I will mention this to you as well, in hopes that it will gain some traction with your website maintainer.

On the top of the IDZ forum (http://software.intel.com/en-us/forums/topic/508194) you have

Development > Tools > Resources

The Resources does NOT include a link to Articles. Can you please have them add a link to the Articles.

Jim Dempsey

From the article: But for types such as std::string, these operations take time, and so constructing or destroying the buffer serially raises the span to Θ(n), clobbering parallelism back to the same order as parallel quicksort.

When sorting objects with non-trivial constructors why sort the objects themselves? Why not produce a sorted list of references of (or pointers to) the objects?

Jim Dempsey


Thanks for the note concerning "Development > Tools > Resources". I'll make sure it gets to the right people.


Leave a Comment

Please sign in to add a comment. Not a member? Join today