By Rajiv Parikh
This paper describes how to speed up a commonly used QuickSort algorithm by taking advantage of Hyper-Threading Technology on Intel® Pentium® 4 processors using OpenMP protocol. The techniques shown here can be applied to any algorithm where parallelization is targeted. The performance gains will, of course, vary with the algorithms as well as system configuration.
Most software uses some form of sorting whether using libraries or hand-coded. This paper provides a brief overview of the classic QuickSort algorithm. We assume you are familiar with general C programming and common data structure and algorithm concepts. QuickSort is a great algorithm to re-architect for a Hyper-Threading Technology machine. We'll show you a few simple techniques to parallelize and balance the algorithm for the threaded scenario. We'll also present performance numbers for comparing the original algorithm with the threaded algorithm across a range of list sizes, and the three basic arithmetic element types: integer, single-precision floating point, and double-precision floating point numbers. All performance data presented was collected on a 2.8 GHz Intel® Pentium® 4 processor with Hyper-Threading Technology. The Intel C compiler 7.0 under Microsoft Visual Studio* 6.0 was used for all compilations. All threading was implemented using OpenMP (see OpenMP specification 2.0) directives, as supported by the Intel C compiler, with /Qopenmp flag.
The classic QuickSort algorithm is widely used in many applications. Listing 1 shows the code. Note the statement assigning the value to v, the partition element.
Listing 1. Original Algorithm, QsortST ( Single Threaded)
The basic algorithm goes through the list in three distinguishable steps.
- Search and swap until i and j meet.
- Create two sub-lists with values less than v and greater than v.
- Then recurse with the two sub-lists...
Note, the final value of i could be anything from 1 to N-1.
This algorithm is best illustrated with the following example. Start with a list as shown in Diagram 1. V is set to be the first element in the array. The first while loop searches from the beginning of the list till the value is greater than v. Similarly, the second while loop searches from the end of the list until it finds the value less than v. It swaps the two values, so the value that is less then v is near the beginning of the list and those that are greater than v are towards the end of the list. This search keeps going until the two searches meet...i.e. i and j meet. In our example, the 4 and the 0 were the only numbers out of order. At this point, we have two sub-lists, one containing the first i elements and the second containing remaining N-i elements. Last step is recursion with each of these two sub-lists.
Diagram 1. Search
Diagram 2. Swap
Diagram 3. Divide
The easiest way to split the code is to create a thread for the two calls when it recurses. However, since we don't want to create a thread every time it recourses-just the first-we create a 'wrapper' sort routine, which makes the two recursive calls. Listing 2 below shows this easy split algorithm.
Notice the OpenMP directives and the call to set the number of threads to two. This creates two sw threads, which the OS schedules across the two CPU threads. The Intel compiler recognizes OpenMP pragmas with the /Qopenmp compile flag. Additionally, omp.h needs to be included in your program file. The keywords omp parallel sections informs the compiler that the following compound statement is 'parallelizable' code, and that each work unit (function call, in this case) is identified using the omp section keywords. Refer to the OpenMP specification* for more details on this and other directives.
Listing 2. Easy Split Algorithm, QsortMT (Multi- Threaded)
Easy Modification Results
For early comparisons, the algorithm was exercised for 10 iterations (data sets), using ten million element list of each of the three numeric types. Table 1 below lists the results in time measurements in seconds.
|(Time in sec)||Avg Time QsortST||Std Dev QsortST||Avg Time QsortMT||Std Dev QsortMT||Perf Gain|
As the Standard Deviation measurements show, the variability in the results for QsortMT (threaded algorithm) was rather large compared to the original algorithm. This variability is a direct result of the Qsort algorithm's inherent flaw in the divide and conquer mechanism. The first list division is not even enough. The i value is not near the middle before starting the recursing threads. When it is too close to either end, it results in longer times. When it is near the middle, it results in shorter times. It varies greatly with the data set and the value v. This results in the average performance gain not being as high as it can be. Notice that this is also known as load balancing across the two threads in the data domain. For other type of problems, balancing the workload in the functional domain may be necessary. In the next section, we will discuss a more intelligent list split.
There are several different approaches we can take to split the list more evenly and address the deviation (and performance) issues from the prior experiment. Rather than going through each of these approaches, we will pick the one approach guaranteed to give us the most optimal load balance. In this elegant solution, we will simply split the list in two sub-lists, then make the two recursive, parallel calls to QsortST with the sub-lists, and finally, merge the two sorted sub-lists into the solution list.
Listing 3. Intelligent Split Algorithm, mQsortMT (modified Multi-Threaded)
Here is the quick description of the code, as illustrated in diagrams 4-6.
- Split the array in two equal parts, a and b with amax and bmax elements.
- Call the Original QsortST for the two equal sub lists on two separate threads.b
- Merge the two sorted sub-lists (a and b)...
- Copy b in to c, since b is part of the original 'data' array, and going to be overwritten.
Copy b in to c, since b is part of the original 'data' array, and going to be overwritten.
Note: This method does require an additional memory of N/2 elements for the c array.
Diagram 4. Split and Search in parallel...
Diagram 5. Create two sorted sub-lists... and copy upper half to c
Diagram 6. Merge the two sub-lists
Intelligent Modification Results
Table 2 shows the performance results for executing ten iterations (data sets), using ten million element list of each of the three numeric types.
Table 2. Intelligent Split Performance Comparison
|(Time in sec)||Avg Time QsortST||Std Dev QsortST||Avg Time QsortMT||Std Dev mQsortMT||Perf Gain|
Key concerns from QsortMT were addressed with mQsortMT. The variability in the results went down significantly... even lower than the original QsortST. This is expected, since the initial array is evenly split. This results in the average performance gain being very impressive across the board over the original QsortST and QsortMT algorithms.
Performance Comparison Charts
The following three charts illustrate the performance gains for the each of the three element numeric data types across various list sizes. Each data point is an average of only ten data sets. All three charts show the mQsortMT is consistently and significantly outperforming the other methods.
For comparison, a single threaded version of the mQsortMT is also shown, labeled mQsortST. This was derived from mQsortMT by removing the OpenMP pragmas. It shows on average single digit percentage degradation over the original QsortST. This is expected since the non-recursive code in QsortST is O(n) and the code for merging the lists in mQsortxT is also O(n) with a few more comparisons. The two recursive calls are common to both.
Figure 1. Performance Comparison chart – double-precision
Figure 2. Performance Comparison chart – Single-precision
Figure 3. Performance Comparison chart – Long Integer
Limitations and Learnings
There are several considerations and assumptions we've made in this paper:
- The re-architected algorithms presented here are only for two threads. This was a conscious decision, since most end users will have systems with a single Pentium 4 processor with Hyper-Threading Technology enabled. A second level wrapper routine can be created for four thread solution (n wrappers for 2^n threads)
- The mQsortMT algorithm requires additional N/2 elements worth of memory.
- Other traditional optimization techniques could also be applied for both original and the re-architected algorithms for some additional gains.
- Instruction level – ie. SSE/SSE2 could be utilized (i.e. for array copying). This would result in all the algorithms showing improvements.
- Algorithmic – i.e. Using a median of multiple values for partitioning QsortXT
- Algorithmic changes have higher effect with threaded code, than on traditional code. However, Parallelizing tight for loops(list merging algorithm) showed no gain in performance, thus not every loop would show performance advantage when parallelized. These algorithmic modifications are left as an exercise for the reader.
- This particular compute application was suited well for workload division in the data domain, where we operate on part of the data on each thread. Other applications may be suited for workload division in the functional domain, i.e. when a sequence of operations is done on a large set of data, in a 'pipeline' fashion, the pipeline can be split across threads.
This is only an introduction on how to get an easy performance gain using OpenMP to take advantage of Hyper-Threading Technology on an Intel Pentium 4 processor. The coding techniques presented here are applicable to algorithms where data sets can be partitioned. Performance gains will vary based on cache and other CPU resources utilization as well as on the algorithm. To gain real benefits from Hyper-Threading Technology many times, re-architecting the high level algorithm for load balancing is necessary. Additionally, note that there may be trade off such as higher memory requirement or code complexity. To learn more about Hyper-Threading Technology, and other optimization techniques, see the "Additional References" section below.
- OpenMP Specification 2.0*
- Hyper-Threading Technology
- Intel® C++ Compiler - Get Help
- Intel® Pentium® 4 Processor Optimization, Reference Manual
About the Author
Rajiv D. Parikh is a Senior Applications Engineer with Intel Corporation in Folsom, CA. Mr. Parikh has held various technical positions at Intel since joining in 1993. His focus has been multi-processor systems and software engineering. He has a Masters and Bachelors of Science in Computer and Electrical Engineering from Virginia Tech.