by Henry Ou
Hyper-Threading Technology enables multithreaded applications to take advantage of thread-level parallelism by providing more than one logical processor within a physical processor package. Intel has announced it will be available on desktop platforms in the near future, but for now, Intel® Xeon® processors are the first implementation of Hyper-Threading Technology. Each physical processor has two logical processors. The logical processors each have their own independent architectural state, but share nearly all other resources on the physical processor, such as caches, execution units, branch predictor, control logic and buses.
Since the majority of microarchitectural resources in a physical processor are shared between the logical processors, efficient sharing of resources between concurrent threads is an important software performance issue for multithreaded applications on processors with Hyper-Threading Technology. Multithreaded applications use synchronization techniques to ensure correct operation. Thread synchronization techniques that are improperly implemented can severely reduce performance when the applications are run on a processor with Hyper-Threading Technology. One common synchronization technique is a long duration spin-wait loop. On a processor with Hyper-Threading Technology, long duration spin-wait loops may consume a significant portion of the execution bandwidth of the processor without producing useful work, thus reducing performance.
This paper addresses the performance impact of a long duration spin-wait on multithreaded applications running on processors with Hyper-Threading Technology. We discuss techniques for avoiding performance impact by long duration spin-wait loops and recommend replacing a long duration spin-wait loop with an operating system thread-blocking application programming interface (API). The paper also provides methodologies for detecting a long duration spin-wait loop that is wasting processor resources.
Performance Issues for Spin-wait Loops
A spin-wait loop is a technique used in multithreaded applications whereby one thread waits for other threads for protecting a critical section, for barriers, or for other synchronizations. From a software perspective, spin-wait loops can be characterized as short duration and long duration, depending on the length of time the thread may need to synchronize with other threads. If the waiting thread needs fast responses, it is a short duration spin-wait loop, often called a fast spin-wait loop. Typically the structure of a fast spin-wait loop consists of a loop that compares a synchronization variable with a predefined value, as shown in sample code 1.
Sample Code 1: Structure of a fast spin-wait loop
A synchronization variable is signified by sync_var in code. Its value is repeatedly compared with a pre-defined value until the desired value is updated by other active threads. On a processor with a super-scalar speculative execution engine, a fast spin-wait loop results in the issue of multiple read requests by the waiting thread as it rapidly goes through the loop. These requests po tentially execute out-of-order. When the processor detects a write by one thread to any read of the same data that is in progress from another thread, the processor must guarantee that no violations of memory order occur. To ensure the proper order of outstanding memory operations, the processor incurs a severe penalty. The penalty from memory order violations can be reduced significantly by inserting a PAUSE instruction in the loop. This eliminates multiple loop iterations in the pipeline.
Sample Code 2: fast spin-wait loop with PAUSE instruction
The following sections focus on the performance impact of long duration spin-wait loops on processors with Hyper-Threading Technology enabled. For more detail on the performance impact of spin-wait loops on conventional multiprocessor systems, refer to Application Note AP-949 "Using Spin-Loops on Intel® Pentium® 4 Processor and Intel® Xeon® processor".
If the active threads are expected to remain busy for a significant length of time, frequent checking of the synchronization variable is not only unnecessary, but it also degrades performance, because the waiting thread ties up a processor for a full operating system time quantum, when another thread could potentially be using the processor. A popular technique to avoid frequent checking of the synchronization variable is to insert an operating system timing service call, such as Sleep(N), in the loop, as shown in sample code 3.
Sample Code 3: Structure of a longer period spin-wait loop
A spin-wait loop with an operating system timing services API to delay the checking of the synchronization variable (as in sample code 3) is characterized as a long duration spin-wait. Argument N in the Sleep function specifies the number of milliseconds that the waiting thread is suspended. A zero value is a popular choice for many multithreaded applications. The zero value causes the waiting thread to yield the processor to other active threads until another processor is available. On a conventional multiprocessor system, this technique is sufficient to prevent the waiting thread from contending for processors with active threads. However, this technique (as we will discuss in the next section) fails to prevent the waiting thread from contending for processor resources on a processor with Hyper-Threading Technology. Multithreaded applications with long duration spin-wait loops need to be modified to gain maximum performance on processors with Hyper-Threading Technology.
Avoid Coding Pitfalls of Longer Duration Spin-wait Loops
Given that the majority of microarchitecture resources in a physical processor are shared between the logical processors in a Hyper-Threading Technology enabled system, synchronization between multiple threads must be designed and implemented with proper care to prevent unnecessary consumption of processor resources. Although using a long duration spin-wait loop structure (such as sample code 3) may not be a large performance issue in a conventional multiprocessor system, it may cause poor performa nce on a Hyper-Threading Technology enabled system. In certain circumstances, the waiting thread may consume a significant portion of the execution bandwidth of the physical processor, severely impacting the performance of the other logical processor that is doing useful work. For example, if the number of active threads is less than the number of processors in the system, a Sleep call with a zero value argument will not put the waiting thread to sleep. Instead, the waiting thread will run on an available processor and operate as a rapidly spinning loop. Then one of other active threads may have to share processor resources with the waiting thread if both are running on the same physical processor.
Sample code 4 below is pseudo code of a common threading model that uses a long duration spin-wait loop for synchronization. The task of the code is to operate on N blocks of data sequentially. The master thread spawns a team of worker threads to do the work in parallel. For each block, the master divides the block of data into smaller subsets, assigns the workloads to worker threads, signals the worker threads to perform the work, then waits for the completion of all worker threads before starting the next block.
Sample Code 4: A threading model that uses long duration spin-wait loop for synchronization
The long duration spin-wait loop in this threading model seldom causes a performance problem on conventional multiprocessor systems. But it may introduce a severe penalty on a system with Hyper-Threading Technology because processor resources can be consumed by the master thread while it is waiting on the worker threads. Sleep(0) in the loop may suspend the execution of the master thread, but only when all available processors have been taken by worker threads during the entire waiting period. This condition requires all worker threads to complete their work at the same time. In other words, the workloads assigned to worker threads must be balanced. If one of the worker threads completes its work sooner than others and releases the processor, the master thread can still run on one processor.
On a conventional multiprocessor system this doesn't cause performance problems because no other thread uses the processor. But on a system with Hyper-Threading Technology the processor the master thread runs on is a logical one that shares processor resources with one of the other worker threads.
The nature of many applications makes it difficult to guarantee that workloads assigned to worker threads are balanced. A multithreaded 3D application, for example, may assign the tasks for transformation of a block of vertices from world coordinates to viewing coordinates to a team of worker threads. The amount of work for a worker thread is determined not only by the number of vertices but also by the clipped status of the vertex, which is not predictable when the master thread divides the workload for working threads.
A non-zero argument in the Sleep function forces the waiting thread to sleep N milliseconds, regardless of the processor availability. It may effectively block the waiting thread from consuming processor resources if the waiting period is set properly. But if the waiting period is unpredictable from workload to workload, then a large value of N may make the waiting thread sleep too long, and a smaller value of N may cause it to wake up too quickly.
Therefore the preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API, such as the Microsoft Windows* threading API, WaitForMultipleObjects. This call causes the operating system to block the waiting thread from consuming processor resources.
Sample Code 5: Replacing the long duration spin-wait loop with a thread-blocking API
Where np in the parameter list is the number of threads being waited for, and threadDoneEvents is a pointer to an array of event objects. The threads being waited upon signal the event objects when the work is completed. A value TRUE in the third argument ensures that the operating system blocks the master thread from accessing processor resources until all threads being waited for complete their work. dw_Milliseconds specifies the time-out interval, in milliseconds; if dwMilliseconds is INFINITE, the function never times out.
The operating system thread-blocking API ensures that the waiting thread relinquishes the processor during the entire waiting period. So this technique is sufficient to prevent wasting processor resources on systems with Hyper-Threading Technology. The thread-blocking API may introduce synchronization overhead on a conventional multiprocessor system, but for a long duration spin-wait loop the overhead is insignificant.
Indicators of Wasted Processor Resources in a Long Duration Spin-wait Loop
Two indicators may help to point out where processor resources are being wasted in a long duration spin-wait loop. The first one is the event Instructions Retired as measured by the Intel® VTune™ Performance Analyzer. When a waiting thread runs on a processor it is actually executing instructions as it waits (as we discussed previously), thus introducing extra instructions. We can collect Instructions Retired events across the application's performance workloads, both with and without Hyper-Threading Technology enabled, and compare the total number of instructions retired.
The number of instructions retired in a multiprocessor system is slightly higher than in a single processor system, normally within 5%, due to the overhead of partitioning, synchronization, and so forth. However, if the number of instructions retired is significantly higher with Hyper-Threading Technology enabled, it is possible that part of the increase may come from extra instructions executed by the waiting thread in a long duration spin-wait loop. The second method to detect the condition is to use Performance Monitor*. The counter Processor Time of the performance object Thread is another metric to pinpoint wasted processor resources in a long duration spin-wait loop. Compare the counter's Processor Time for all threads of an application between two runs, one run on a single processor system and the other on a multiprocessor system. If a certain period is identified when processor time is low for a thread on a single processor system, but processor times are high on a system with H yper-Threading Technology during the same period, then it is a strong indicator that this thread is consuming processor resources needlessly during that period.
Figures 1 and 2 are screen shots of Performance Monitor showing a period of processor time for threads on a real-world application, which uses the threading model described in sample code 4. Figure 1 shows the threads running on a single processor system without Hyper-Threading Technology. In Figure 2, the threads run on a system with Hyper-Threading Technology. In this period, thread A is idle when thread B is executing in Figure 1. In comparison with Figure 2, both thread A and thread B take 100% processor time on each logical processor. Assuming equal thread priority, we should see some sharing of the processor between thread A and thread B in Figure 1 if thread A is doing useful work. Since we don't observe any processor sharing between the two threads in Figure 1, it is a strong hint that thread A is doing a spin-wait on thread B. In Figure 1, thread A is yielding the processor to thread B. In Figure 2, thread A does not need to yield because it has its own available processor.
Figure 1. Performance Monitor Counters: Processors' Time for Threads - A period of application running on a system without Hyper-Threading Technology (one processor)
Figure 2. Performance Monitor Counters: Processors' Time for Threads - A period of application running on a system with Hyper-Threading Technology enabled (two logical processors)
There are two cautions about using these indicators. Although wasting processor resources results in a higher number of Instructions Retired with Hyper-Threading Technology, we cannot draw the converse conclusion that a higher number of Instructions Retired is necessarily evidence of wasting processor resources. The higher rate may be due to other overhead, such as multiprocessor partitioning and synchronization. We need further confirmation with the indicator Processor Time of Threads or, if source code is available and spin-wait loops are found in the code, we may try to replace the spin-wait loops with operating system thread-blocking APIs and observe the performance difference.
A similar caution applies to the comparison of Processor Time of Threads between Hyper-Threading Technology and single processor runs. If the processor time for a thread is high with Hyper-Threading Technology but not single processor, you have strong evidence of wasted processor resources. But if the runs do not show this disparity, you cannot conclude for certain that resources are not being wasted. For example, if the master thread in the threading model of sample code 4 also serves as a worker thread to operate on a portion of block data as other worker threads, Processor Time of Threads may fail to show the disparity. Efficiency of the two indicators varies with applications and workloads. However, for most of the cases, the methods do serve as good indicators pointing t o the presence of long duration spin-wait loops that cause performance degradation.
In the first implementation of Hyper-Threading Technology, there are two logical processors on each physical processor. The logical processors share almost all of the physical execution and hardware resources. Efficient use of processor resources between concurrent threads becomes an important software optimization issue for multithreaded applications to achieve their best performance on processors with Hyper-Threading Technology.
A long duration spin-wait loop is a synchronization technique used in multithreaded applications. This technique works efficiently on a conventional multiprocessor system. But a long duration spin-wait loop may cause unnecessary consumption of processor resources and reduce performance on processors with Hyper-Threading Technology.
The preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API. The event Instructions Retired in the Intel VTune Performance Analyzer and the counter Processor Time of Threads of Performance Monitor are two good indicators for determining the presence of long duration spin-wait loops that are causing performance degradation on a Hyper-Threading Technology enabled system.
The following documents are referenced in this application note, and provide background or supporting information for understanding the topics presented in this document.
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Using Spin-Loops on Intel® Pentium® 4 Processor and Intel® Xeon® Processor, Intel Application Note AP-949
About the Author
Henry Ou is an Applications Engineer working for Intel's Software and Solutions Group.