| Last Modified On : | October 24, 2007 11:08 AM PDT |
Rate |
|
By Rajiv Parikh
This paper describes how to speed up a commonly used QuickSort algorithm by taking advantage of Hyper-Threading Technology on Intel® Pentium® 4 processors using OpenMP protocol. The techniques shown here can be applied to any algorithm where parallelization is targeted. The performance gains will, of course, vary with the algorithms as well as system configuration.
Most software uses some form of sorting whether using libraries or hand-coded. This paper provides a brief overview of the classic QuickSort algorithm. We assume you are familiar with general C programming and common data structure and algorithm concepts. QuickSort is a great algorithm to re-architect for a Hyper-Threading Technology machine. We'll show you a few simple techniques to parallelize and balance the algorithm for the threaded scenario. We'll also present performance numbers for comparing the original algorithm with the threaded algorithm across a range of list sizes, and the three basic arithmetic element types: integer, single-precision floating point, and double-precision floating point numbers. All performance data presented was collected on a 2.8 GHz Intel® Pentium® 4 processor with Hyper-Threading Technology. The Intel C compiler 7.0 under Microsoft Visual Studio* 6.0 was used for all compilations. All threading was implemented using OpenMP (see OpenMP specification 2.0) directives, as supported by the Intel C compiler, with /Qopenmp flag.
The classic QuickSort algorithm is widely used in many applications. Listing 1 shows the code. Note the statement assigning the value to v, the partition element.
Listing 1. Original Algorithm, QsortST ( Single Threaded)
The basic algorithm goes through the list in three distinguishable steps.
Note, the final value of i could be anything from 1 to N-1.
This algorithm is best illustrated with the following example. Start with a list as shown in Diagram 1. V is set to be the first element in the array. The first while loop searches from the beginning of the list till the value is greater than v. Similarly, the second while loop searches from the end of the list until it finds the value less than v. It swaps the two values, so the value that is less then v is near the beginning of the list and those that are greater than v are towards the end of the list. This search keeps going until the two searches meet...i.e. i and j meet. In our example, the 4 and the 0 were the only numbers out of order. At this point, we have two sub-lists, one containing the first i elements and the second containing remaining N-i elements. Last step is recursion with each of these two sub-lists.
Diagram 1. Search
Diagram 2. Swap
Diagram 3. Divide
The easiest way to split the code is to create a thread for the two calls when it recurses. However, since we don't want to create a thread every time it recourses—just the first—we create a 'wrapper' sort routine, which makes the two recursive calls. Listing 2 below shows this easy split algorithm.
Notice the OpenMP directives and the call to set the number of threads to two. This creates two sw threads, which the OS schedules across the two CPU threads. The Intel compiler recognizes OpenMP pragmas with the /Qopenmp compile flag. Additionally, omp.h needs to be included in your program file. The keywords omp parallel sections informs the compiler that the following compound statement is 'parallelizable' code, and that each work unit (function call, in this case) is identified using the omp section keywords. Refer to the OpenMP specification* for more details on this and other directives.
Listing 2. Easy Split Algorithm, QsortMT (Multi- Threaded)
Easy Modification Results
For early comparisons, the algorithm was exercised for 10 iterations (data sets), using ten million element list of each of the three numeric types. Table 1 below lists the results in time measurements in seconds.
| (Time in sec) | Avg Time QsortST | Std Dev QsortST | Avg Time QsortMT | Std Dev QsortMT | Perf Gain |
| Double | 1.91 | 0.022 | 1.53 | 0.226 | 25% |
| Float | 1.73 | 0.015 | 1.45 | 0.242 | 19% |
| Long | 1.38 | 0.011 | 1.11 | 0.168 | 24% |
As the Standard Deviation measurements show, the variability in the results for QsortMT (threaded algorithm) was rather large compared to the original algorithm. This variability is a direct result of the Qsort algorithm's inherent flaw in the divide and conquer mechanism. The first list division is not even enough. The i value is not near the middle before starting the recursing threads. When it is too close to either end, it results in longer times. When it is near the middle, it results in shorter times. It varies greatly with the data set and the value v. This results in the average performance gain not being as high as it can be. Notice that this is also known as load balancing across the two threads in the data domain. For other type of problems, balancing the workload in the functional domain may be necessary. In the next section, we will discuss a more intelligent list split.
There are several different approaches we can take to split the list more evenly and address the deviation (and performance) issues from the prior experiment. Rather than going through each of these approaches, we will pick the one approach guaranteed to give us the most optimal load balance. In this elegant solution, we will simply split the list in two sub-lists, then make the two recursive, parallel calls to QsortST with the sub-lists, and finally, merge the two sorted sub-lists into the solution list.
Listing 3. Intelligent Split Algorithm, mQsortMT (modified Multi-Threaded)
Here is the quick description of the code, as illustrated in diagrams 4-6.
Copy b in to c, since b is part of the original 'data' array, and going to be overwritten.
Note: This method does require an additional memory of N/2 elements for the c array.
Diagram 4. Split and Search in parallel...
Diagram 5. Create two sorted sub-lists... and copy upper half to c
Diagram 6. Merge the two sub-lists
Intelligent Modification Results
Table 2 shows the performance results for executing ten iterations (data sets), using ten million element list of each of the three numeric types.
Table 2. Intelligent Split Performance Comparison
| (Time in sec) | Avg Time QsortST | Std Dev QsortST | Avg Time QsortMT | Std Dev mQsortMT | Perf Gain |
| Double | 1.91 | 0.022 | 1.37 | 0.016 | 39% |
| Float | 1.73 | 0.015 | 1.09 | 0.011 | 58% |
| Long | 1.38 | 0.011 | 0.88 | 0.009 | 57% |
Key concerns from QsortMT were addressed with mQsortMT. The variability in the results went down significantly... even lower than the original QsortST. This is expected, since the initial array is evenly split. This results in the average performance gain being very impressive across the board over the original QsortST and QsortMT algorithms.
The following three charts illustrate the performance gains for the each of the three element numeric data types across various list sizes. Each data point is an average of only ten data sets. All three charts show the mQsortMT is consistently and significantly outperforming the other methods.
For comparison, a single threaded version of the mQsortMT is also shown, labeled mQsortST. This was derived from mQsortMT by removing the OpenMP pragmas. It shows on average single digit percentage degradation over the original QsortST. This is expected since the non-recursive code in QsortST is O(n) and the code for merging the lists in mQsortxT is also O(n) with a few more comparisons. The two recursive calls are common to both.
Figure 1. Performance Comparison chart – double-precision
Figure 2. Performance Comparison chart – Single-precision
Figure 3. Performance Comparison chart – Long Integer
There are several considerations and assumptions we've made in this paper:
This is only an introduction on how to get an easy performance gain using OpenMP to take advantage of Hyper-Threading Technology on an Intel Pentium 4 processor. The coding techniques presented here are applicable to algorithms where data sets can be partitioned. Performance gains will vary based on cache and other CPU resources utilization as well as on the algorithm. To gain real benefits from Hyper-Threading Technology many times, re-architecting the high level algorithm for load balancing is necessary. Additionally, note that there may be trade off such as higher memory requirement or code complexity. To learn more about Hyper-Threading Technology, and other optimization techniques, see the "Additional References" section below.
Rajiv D. Parikh is a Senior Applications Engineer with Intel Corporation in Folsom, CA. Mr. Parikh has held various technical positions at Intel since joining in 1993. His focus has been multi-processor systems and software engineering. He has a Masters and Bachelors of Science in Computer and Electrical Engineering from Virginia Tech.
| September 30, 2009 10:26 AM PDT
Dave Valentine | Mea culpa. We had an error in our own code; problem solved. |

English | 中文 | Русский | Français
Rajiv Parikh (Intel)
|
Dave Valentine
we are trying to replicate your results in a multicore environment. Our problem is attempting to sort even 1million doubles is generating memory allocation errors, etc. Integers & floats are working okay. We are dynamically generating the array (malloc) and freeing it when done.
Our machine has an Intel Core2 Quad @2.4GHz, winXP and 3GB RAM (we run Ubuntu on the same box).
We tried it in .NET and in Ubuntu with much the same results....
Can you offer us any advice?
Thank you.