by Andrew Binstock
How to squeeze the maximum performance from Intel® Hyper-Threading Technology and get the most punch from your processor's execution pipelines.
The release of the 3.06GHz Pentium® 4 chip with Intel® Hyper-Threading Technology (Intel® HT technology) will be remembered as a signal event: With the release announcement, and even a bit before, design of corporate desktop applications began moving away from the longstanding single-threaded design model to the more efficient multithreaded model. The impact on developers is clear: to make best use of processors, programmers must begin coding desktop software using threads. The old linear single-threaded model is going away.
Hyper-Threading Technology is particularly effective with the tight, high-performance routines that frequently arise in multimedia, imaging, audio, and cryptography software. If you develop such Digital Media applications, this article is for you. It explains what things can be done to accelerate multimedia performance.
Data Decomposition Done Right
Developers familiar with multiprocessor applications know that programs are architected for threads using either data- or functional decomposition. In the former case, data sets are broken up into chunks and processed in parallel on different threads. Whereas in functional decomposition, program instructions are broken up into routines that can be assigned to different threads in the hope of having them execute in parallel. Both forms of decomposition are critical to the correct design of threaded applications. However, for high-speed routines, such as codecs, data decomposition is often the most productive form and it's certainly the easiest to do. Let's look at how to do it right.
Quick and easy data decomposition comes in the form to taking a loop that runs through a large data block and converting it into two loops that each run through half the block. So,
int j, k; float fa, fb, fArray; //...some code, then... for ( j = 0; j < 1000; j++ ) fArray[j] = fa[j] * fb[j];
can be broken into:
for ( j = 0; j < 500; j++ ) fArray[j]= fa[j] * fb[j]; for ( k = 500; k < 1000; k++ ) fArray[k] = fa[k] * fb[k];
Each loop is then assigned to a separate thread and the two loops execute in parallel. This division of labor enables completion of the operation in much less time than the original code. There are several aspects of this code that bear comment.
The first is that when HT technology was first announced, developers were strongly urged to place two threads that used similar resources (like these two loops) onto different physical processes. The concern was that if both loops were placed on logical processors residing on the same physical processor, the threads would both contend for the same resources (here, floating-point execution units) and end up blocking each other. While this risk does indeed exist, it occurs far less frequently than originally expected. This is due to the nature of the loop instructions. Consider that the preceding loops actually execute the following operations on every pass:
1 floating-point multiplication
1 floating-point assignment
1 integer increment
1 integer compare
When data fetches are added to this set of instructions, it becomes clear that it's most unlikely that floating-point operations will occur simultaneously with greater frequency. Note that data fetches create inherent execution offsets because cache stripes (the minimum amount of data fetched from memory-generally 64 or 128 bytes depending on the cache and the processor) are fetched sequentially from memory. Hence, even if both threads issue fetches to memory simultaneously their results cannot be delivered at the same time. Because these effects prevent the constant simultaneous need for the same resources, contention for execution resources is a small problem and no unusual efforts should be exerted to avoid it.
Divide and Conquer
The typical manner for breaking up loops that process large amounts of data, such those that occur in video decoding, is to have one master thread divide up the work into, say, four chunks. The master thread then wakes up three worker threads that, along with the master thread, work on their respective data. When done, the worker threads are put back to sleep.
This approach is particularly effective in video, audio, and imaging applications, where data decomposition is generally more productive than functional decomposition. However, to get the maximum performance from this approach, some caveats should be considered:
Use thread pools. Creating and killing threads can be expensive, whereas waking up an existing thread is cheap. Build a pool of threads and reuse them as you need.
Use only as many threads as there are logical processors. If you use more threads than processors, the benefit of this approach will be severely reduced. So, at start up, determine the number of logical processors on the system and allot no more than that num ber of threads. By the same token, make sure to use all the resources available to you; so don't use fewer threads than you have at your disposal.
Watch out for alignment conflicts. Current processors that support HT technology run into conflicts caching data items that are a multiple of 64K bytes apart. Under normal circumstances, this might not happen much, except that Microsoft* Windows* starts each new thread on a 1MB boundary. And since 1MB is an even multiple of 64K (64x16 = 1024), these alignment conflicts occur immediately. The best way to avoid the conflict is to use the _alloc() function to create irregular offsets (each thread uses an offset of a different size) for the thread's stack before creating thread-specific data items.
Watch out for spin-wait loops. Spin-waits are small tight loops that spin until some variable they're monitoring changes in value. In the present example, the master thread might use a spin-wait loop to test the status of the worker threads to know when they have completed processing their assigned data. Because of the way out-of-order execution is performed on HT technology-capable processors, loops whose only function is to check a variable can flood the execution pipeline with reads of the variable. This situation significantly slows both threads on the physical processor. Avoid it by inserting the pause instruction in spin-wait loops. This unfortunately named instruction actually speeds the operation of spin-waits and should be considered on all small, tight loops.
Getting the Jump on the Data
The core inside the Pentium 4 and Intel® Xeon® processors has a built-in hardware prefetch mechanism for loading data blocks from memory. When it recognizes a pattern of memory accesses, it anticipates the next stripe to be used and preloads it into cache. The goal is to avoid a cache miss, which significantly slows thread performance. The one handicap this mechanism suffers from is that it does not kick in until it recognizes a pattern of fetches. This means that the first few accesses to an array, let us say, will result in a series of cache misses. One way to eliminate these misses is to manually prefetch the data into cache. To do this, use the _mm_prefetch instruction, which is part of the Streaming SIMD Extensions (SSE) that first appeared with the Pentium III processor. This instruction is accessible via the Intel® C++ compiler and the C/C++ compilers available in Microsoft* Visual Studio* and VS.NET.
Ideally, several stripes should be loaded, so that the hardware prefetch kicks in before the next one is needed. Intel suggests that the manual prefetch instructions be placed about 100 clocks before the data is first accessed.
Prefetching need not be limited to these situations. Anytime a developer can tell a data item is about to be accessed for the first time, prefetching should be considered. Prefetching too much, however, can compromise performance, especially if the fetches occur too early. Every fetch to cache displaces an item already there. If the displaced item is needed while the prefetched data is sitting idle the operation has hurt performance by creating a cache miss. Therefore, use prefetching when it makes sense and not too long before the fetched item is needed. When and how much to prefetch exactly needs to be ascertained by trial and error. (Intel's Pentium® 4 and Intel Xeon® Processor Optimization Manual, presents a formula for computing how early to prefetch data. This formula is fairly complicated and requires specific knowledge of the details of the execution platform. Despite this requirement, it gives a general sense of how much lead time is advisable. To download the manual, see Resources section at the end of this article.)
Cache Sizes and Blocks
When dividing data into blocks for individual threads, the size of the blocks should be small enough that the entire block can fit into cache. If the block is bigger than this, portions of it must be swapped in and out of cache-a situation that significantly reduces the benefits of threads. Intel recommends that on HTT-enabled processors, the cache block size be between one-quarter and one-half of the physical cache on the processor. Determining the exact right size is best done through trial and error.
On non-HTT processors, the recommended block size is one-half to three-quarters of the size of the processor cache. The reason for the difference is that two HTT threads share the physical processor's cache. This sharing gives rise to another optimization opportunity. If two threads will need access to the same data blocks, they should be placed on the same physical processor. When you do this, they can both avail themselves of the same data in cache-meaning that the number of data fetches will be halved-a potentially significant improvement.
The introduction of HT technology on today's processors will initiate many developers into the world of multithreaded programming, where problems and the corresponding solutions are known and well-established. However, due to the unique aspects of the technology's implementation, even thread-savvy developers need to be aware of certain ins and outs, so as to avoid inadvertently vitiating the advantages the technology confers. This article has shown what you can do to improve the performance using HT technology. The following resources provide supplementary information.
- Multi-Core Developer Community
- Digital Media Developer Center
- Threading Methodology: Principles and Practices, Intel Technical Report 9/2002. A good and thorough overview of threading with coverage of HTT-specific issues. Good coverage of how Intel's development tools can assist in optimizing threading implementations.
- The Software Optimization Cookbook, Richard Gerber, 2002, Intel Press. A lucid overview of what can be done to make code faster. Contains many code samples and approachable explanations of how to use Intel-specific features such as SIMD and threading. HTT, as such, receives instructive although limited coverage.
- Details on using individual Intel-specific instructions are best referenced in the programming manuals for the subject processors. For Pentium 4 and Xeon processors, these manuals can be downloaded from: http://developer.intel.com/products/processor/manuals/index.htm. This URL also contains an extensive manual on optimization techniques.
About the Author
Andrew Binstock is the principal analyst at Pacific Data Works LLC. Previously he was the director of PriceWaterhouseCoopers's Global Technology Forecasts. His book "Practical Algorithms for Developers" co-written with John Rex is in its 12th printing at Addison-Wesley and in use at more than 30 university computer-science programs. Binstock can be reached at email@example.com.