4,391 Posts served
10,712 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Site News & Announcements

- Software Tools

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
All Sorts of Sorts
By Asaf Shelly (31 posts) on April 27, 2009 at 8:28 am
Hi all,
As some of you may already know I am getting married May 11'th. Yes, yes, a very happy occasion. Doesn't leave time for anything... Well, there is enough for a blog post but not enough for writing code and doing some QA.
I actually started with the Radix-Sort Challenge (Threading-Challenge-2009) but couldn't find the time to have it completed, not even talking about tested… So I thought that I might as well share my thoughts about this problem. Philosophically that is, nothing tested, no responsibility, so I can allow myself to wonder off and talk about things that may not even work ;-)
Going over Wikipedia (as briefly as possible) I learned that Radix sort is a type of sort that deals with constant size data types. Sorting algorithms deal massively with data comparisons. Since the input for this challenge was defined as 7 ASCII characters for each data item, it is very simple to convert it to a 64 bit integer. As a collection of 64 bit integers the CPU can compare two items in a single instruction and all new compilers have support for that. So basically the task is to sort a list of integers.
I was asking myself how many cores I should use and whether or not I should use the build in CPU acceleration such as MMX. The simple answer I found was that if the collection of numbers can fit in the fast cache then I should start doing some math… However the problem here deals with somewhere up to 2^31 items, or up to 2G of items. This is way over the size of any cache so the work is done by the CPU directly with memory.
The memory is always slower than the CPU and the operation to perform takes a single CPU instruction. This means that the bottle neck should be the memory and therefore the strategy should start by reducing memory transactions to the minimum.
The simplest and fastest solution I could think of for sorting numbers was to merge sorted lists. For example when we want to sort a list of 8 items we first sort the data as two lists of 4 items and then join the two lists. This manner requires only a single passage over every item in the list for every merge. For example a list of 1,6,3,7,9,2,5,4 is split and sorted into 1,3,6,7 and 2,4,5,9. The sort algorithm only goes forward in the list and has only a single comparison to make. Working in merges it takes 31 passes over every data item to complete 2G of items.
The problem is that reading an item and copying it to an output list handles 2K of memory for every 1K of data to sort. The goal is to reduce memory transaction to the minimum. The solution is to swap the items in the merged lists, so for two merging lists of A1,A2,A3,A4 and B1,B2,B3,B4 the output should be a multi-dimensional array of: A1,B1,A2,B2,A3,B3,A4,B4. The next merge will result in A1,B1,C1,D1,A2,B2,C2… and so on. This will keep memory transactions to the minimum amount and also ensure that the memory is still in CPU cache for as long as possible.
Eventually the merges should be on lists that are too long for the cache to hold them, for example a merge between two lists of 100MB. The only thing we can do to utilize the cache is to keep merging on the same list for as long as possible, instead of methodically merging all the 1KB lists in the system from start to end. Another optimization that might help just a bit is merging forwards and backwards alternatively instead of going forwards all the time. When a forward merge is complete the end of the list is in cache so it is more efficient to start merging from that point downward to the beginning of the list. At the end of a reverse order merge the cache holds the beginning of the list and it is most efficient to merge forwards.
This is what I was going to try and test. Single threaded. That is of course if I had the time… but as you already know, I barely have the time to write about it in a blog post.
Good luck to everyone who did.
If you are reading this and you feel that my guesses are all bad – feel free to say it out loud. You can also reflect about it out loud, or just congratulate me for my eminent wedding… anything goes :-)
Best,
Asaf
Categories: Parallel Programming, Software Tools
Tags: Guest Blog, multi-core
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (27)
| April 27, 2009 11:56 AM PDT
Dmitriy Vyukov
| At least that's what he wanted to do with my solution when I tried to use introsort for small arrays... |
| April 27, 2009 3:01 PM PDT
Clay Breshears (Intel)
| Good luck with the wedding, Asaf. |
| April 27, 2009 3:11 PM PDT
Clay Breshears (Intel)
| Radix sort deals with the keys as strings of bits. The values of the bits determines how to move keyed data. Radix Exchange Sort uses single bits at a time starting from the most significant bit and proceeding down; Straight Radix Sort uses clumps of bits from the least significant bit and proceeding up. In the latter case, the movement of data must be stable, i.e., if two keys have the same bit pattern in the portion of the key being examined, they must remain in the same relative order after movement based on that portion as they were before the data movement. For a 7-character key, Radix Exhange would take 56 data movements while Straight Radix would only need 14 movements if clumps of bits were taken 4 at a time. The latter, of course, has more bookkeeping to get things right. |
| April 27, 2009 4:45 PM PDT
Asaf Shelly
|
Hi Dmitriy and Clay, First of all Thanks, wedding is a big project (and you can't fix and recompile there :) A few things to comment here: 1. I had a feeling that it won't be clean Radix, and yet the point of memory usage is very valid and important to demonstrate. 2. Can I argue that even with full CPU optimization we still want effective memory usage and that would be the one described above? 3. I was never good with unnatural rules and bounds. I can't seem to see them :-) Thanks guys, Asaf |
| April 27, 2009 5:11 PM PDT
Asaf Shelly
|
Now that I think about it, if I get Radix right then it is about taking a pile and breaking it into two piles, repeatedly. This can be done by using the CPU instruction of bit-test - immediate with memory. Loop flow is: Mem: read to CPU CPU: test bit Mem: write from CPU If this is the case then again the CPU is always keeping the memory bus busy even with a single core, assuming a buffer larger than cache, for example 100MB. Asaf |
| April 28, 2009 2:36 AM PDT
Dmitriy Vyukov
|
Radix sort is indeed memory intensive. Another serious problem is L1 DTLB misses provided that one choose 7-bit radix (I guess 7-bit radix was a case in many cases including my). Modern processors feature 32/64 entries, with HT this reduces to 16/32, and we have 96 printable characters. So basically we get L1 DTLB miss on every value (if buckets are large or allocated not densely). I've tried to use large pages, but found out that Windows and Linux both TOOOOOOOOTALLY sucks wrt large pages, making them basically useless. I've found some ways to still allocate several large pages, maybe they would help, don't know, time is too limited. |
| April 28, 2009 2:46 AM PDT
Dmitriy Vyukov
|
I guess one might use basically any kind of sort provided that he is able to "prove" that that sort IS-A radix sort. I used counting sort with a justification that counting sort is not more that a private case of a radix sort, i.e. IS-A radix sort. Counting sort just uses special intermediate representation of buckets. Counting sort still uses the same algorithm skeleton:
// divide
for (int i = 0; i != size; i +=1)
{
int radix = data[i][0];
bucket[i].put(data[i]);
}
// conquer is a no-op
// join
int j = 0;
for (int i = 0; i != bucket_count; i += 1)
{
while (int value = bucket[i].get())
{
output[j++] = value;
}
}
I am writing in so detail, because I have to incline "The Judge" to my point :) |
| April 28, 2009 8:34 AM PDT
Asaf Shelly
|
How did you parallelize this? Can you tell the performance delta for many cores vs. single core? Intel should probably report to the OS when two or more cores are competing over the same memory range so the OS could relocate the threads to the same core, when memory bus is flooded of course. |
| April 28, 2009 10:09 AM PDT
jimdempseyatthecove
|
[Clay] "the movement of data must be stable, i.e., if two keys have the same bit pattern in the portion of the key being examined, they must remain in the same relative order after movement based on that portion as they were before the data movement." In order to maintain the relative order of duplicate keyed (or portion of key) records using N threads each of the N threads would have to work on to order a subset of the total number of records. Thread 0 taking the 1st group of (number of records/N), thread 1 the next, .... After the 1st pass using N threads in parallel then N/2 threads could merge in parallel, then N/4 merge, ... until a single thread performs the last merge. Note, the merge phase is not a record by record merge, rather it is a list by list merge (no comparison, simply list concatination). By using an 11-bit radix on a 32-bit key you experience a good trade-off between bucket size and performance. This requires 3 passes on the data (single core). |
| April 28, 2009 11:05 AM PDT
Dmitriy Vyukov
|
Re: How did you parallelize this? I didn't :) Though it must be not harder than parallelization of the normal radix sort, i.e. input data is split into partitions, each thread processes it's own partition to it's own bucket array. When all partitions are done, data is copied back (this also may be done in parallel since final destinations of each element can be calculated). |
| April 28, 2009 11:08 AM PDT
Dmitriy Vyukov
|
Re: Intel should probably report to the OS when two or more cores are competing over the same memory range so the OS could relocate the threads to the same core, when memory bus is flooded of course. There is such research in the field of NUMA OS schedulers (AFAIR for Linux). It's based on hardware performance counters to determine when thread causes too many cache misses/remote accesses, and then maybe some probing (I am not sure) to determine where the data actually located. Based on this information OS scheduler may decide to migrate the thread. I can't find the reference now. |
| April 28, 2009 11:11 AM PDT
Dmitriy Vyukov
|
Re: By using an 11-bit radix on a 32-bit key you experience a good trade-off between bucket size and performance. This requires 3 passes on the data (single core). This approach may be killed by zillions of L1 DTLB misses. With 11 bit radix you need 2048 entries in DTLB for buckets, while current processors have only 64 (32 with HT). Large pages may help, but they do not, especially if input data in not uniformly distributed. |
| April 28, 2009 3:50 PM PDT
Asaf Shelly
|
Re: "...using N threads in parallel then N/2 threads..." My question is Jim: Is there any point in having more than a thread per core? (or per CPU HT) Moreover: How many threads running in parallel are needed to flood the memory bus? I can only guess that this number should be less than 8 threads... |
| April 28, 2009 3:56 PM PDT
Asaf Shelly
|
Re: "I didn't :) Though it must be not harder than parallelization of the normal radix sort" I am wondering whether or not the synchronization overhead is worth the performance increase in using two or more cores at the same time, especially since memory access may become a bottle-neck if ignored. |
| April 29, 2009 1:34 AM PDT
Dmitriy Vyukov
|
More than a thread per code may somehow hide latency of page faults. If program does not suffer from page faults then there is no point. Number of threads required to flood memory bus is highly platform-dependent. Since contestants are not able to "experiment" on target platform, we may only guess. Xeon 5500 series is completely new platform with QPI, NUMA, 3 memory channels, etc... who knows... I more and more like AMD ThreadFest approach, where contestants was able to run *any* code on target platform and get output back. |
| April 29, 2009 1:40 AM PDT
Dmitriy Vyukov
|
As a one measure to reduce memory pressure I was trying to keep HT sibling threads as close to each other as possible (in term of working set). I.e. both HT sibling threads share the same task queue and both use LIFO order of pushes/pops. ... well... afair I turn off this feature in the final submission because on my Q6600 and P9500 it slow downs the program a bit. How it will affect 2xX5500? It seems that we will not get to know this... |
| April 29, 2009 1:44 AM PDT
Dmitriy Vyukov
|
Re: I am wondering whether or not the synchronization overhead is worth the performance increase... What synchronization overhead? ;) |
| April 29, 2009 3:51 AM PDT
Asaf Shelly
|
One can only assume that with an algorithm split over several threads when one thread invokes a page-fault then the others will too. Interesting. No syncronization overhead? Life is good for you :-) |
| April 29, 2009 5:08 AM PDT
Dmitriy Vyukov
|
Re: One can only assume that with an algorithm split over several threads when one thread invokes a page-fault then the others will too. Yes, that's a kind of software scouting (as opposed to hardware scouting that is used on highly CMT (chip multi-threaded) chips. And that's good because that creates some excess of disk read requests, thus gives more opportunities to disk scheduler, thus increases efficiency. I've observed that 2 threads produces 2 times more page faults per second (read - processing speed is doubled provided that bottleneck was a disk). |
| April 29, 2009 5:14 AM PDT
Dmitriy Vyukov
|
Re: No syncronization overhead? Life is good for you :-) Yeah, my programs do not contain synchronization overhead :) What exactly source of sycnhronization overheads you meant? The only synchronization overhead in the algorithm I've described is:
void partition_completed()
{
if (0 == atomic_dec(number_of_pending_partitions))
spawn_next_phase();
}
If this single modification of shared location per whole partition causes excessive overheads in your program, then this means that you've just chosen too small partition size. If you will increase partition size to let's say 4K values, sycnhronization overheads will instantly drop below 1%. Simple. If my final variant total synchronization overheads (task spawning, task scheduling, memory management, atomic counting) are around some 1-2%. |
| April 30, 2009 8:24 AM PDT
Asaf Shelly
|
I can see how using atomic operations shouldn't have too much overhead. How many threads do you have for this process? How do you pause until all stages of a phase are complete? |
| May 1, 2009 1:56 PM PDT
Dmitriy Vyukov
|
I have 16 threads. On my old Pentium4 HT machine usage of 2 threads give some speedup. To wait for the phase to complete I use PAUSE instruction. However waiting is only used after initial radix split, then it produces 96 independent tasks, so threads may work on their private tasks w/o any waiting. I post my write-up as a blog, currently in "review pending" status. It contains basic description of the parallel decomposition. |
| May 3, 2009 12:52 AM PDT
Asaf Shelly
| Interesting read. I'll wait for it to publish and continue the conversation there.. |
| May 4, 2009 3:43 PM PDT
Clay Breshears (Intel)
|
Radix Exchange Sort is like Quicksort. A Partition function separates keys into two sections depending on whether or not the bit of interest is a zero or one. Typically you start scanning at one end until you find a key that belongs in the other section, then scan from the other end until a key to go to the other section is found, then swap the keys and conitinue until the scans meet. Sort on the next least significant bit within each section, which could be done recursively. For parallel, each section is independent, so store the bounds of one or both sections in a shared queue and have threads pull out sections to sort from the queue. There is only the swap of two keys at a time to overfill the memory bus. Cache use is predicatable through the scans, and if each thread keeps one of the section bounds, part of the data for the section to next be sorted should be in cache already. I'm not sure what everyone was talking about with the "buckets" above. The above version of radix sort can all be done in-place. |
| May 6, 2009 10:27 AM PDT
Dmitriy Vyukov
|
The blog is not live: http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/ |
| May 6, 2009 10:27 AM PDT
Dmitriy Vyukov
|
The blog is NOW live: http://software.intel.com/en-us/blogs/2009/05/06/another-sorts-of-sorts/ |
Trackbacks (10)
- Intel Software Network Blogs » Another Sorts of Sorts
May 6, 2009 9:50 AM PDT - Another Sorts of Sorts
May 6, 2009 11:08 AM PDT - Intel Software Network Blogs » Parallel Programming Talk - Listener Question: Radix Sort Solution
May 6, 2009 11:39 AM PDT - Another Sorts of Sorts - Storage Informer
May 6, 2009 11:46 AM PDT - Radix Sort Solution - Storage Informer
May 6, 2009 1:47 PM PDT - Parallel Programming Talk - Listener Question: Radix Sort Solution
May 6, 2009 4:22 PM PDT - Intel Software Network Blogs » Parallel Programming Talk - Listener Question: Radix Sort Solution
May 12, 2009 4:06 PM PDT - 525i Parts Usa Much, 525i Private
May 19, 2010 10:21 PM PDT - Storm King Art Center Coach Usa, Storm Zone Caught In The Act Full Download
May 20, 2010 8:48 AM PDT - 144a Used Restricted Securities Act Of 1933, 18y0144 Walmart
May 22, 2010 10:55 AM PDT



Dmitriy Vyukov
43,814