by Brian J. Welch
Intel® processors with Hyper-Threading Technology promote maximum system resource utilization by allowing multiple logical processors to share physical resources within the same physical processor package. To applications, logical processors appear to be no different than the processors in systems without Hyper-Threading Technology. However, to maximize application and system performance, it can be helpful for developers to understand the relationship between application threads, logical processors, physical processors, and performance.
Before the introduction of Hyper-Threading Technology, all processors in an IA-32 multiprocessor system were identical to each other, and each processor was capable of the same performance as the other processors in the system. With Hyper-Threading Technology, all physical processors in the system are still identical to each other. However, each physical processor contains multiple logical processors and each of those logical processors share resources with each other. A side effect of resource sharing is that the performance of one logical processor is dependent upon the resources being used by the other logical processor(s) within the same physical processor. Essentially, all logical processors in a system with Hyper-Threading Technology do not have equal performance at any given time.
Load imbalance is a condition where the system workload is not evenly distributed across all physical processors in the system. Therefore, resource sharing between logical processors may be much higher on some physical processors. Load imbalance may cause inconsistent, sub-optimal, or degraded performance on systems with Hyper-Threading Technology. This paper describes the effects of load imbalance on processors with Hyper-Threading Technology and proposes some techniques for balancing workloads across physical processors.
Examples of Load Imbalance
When imbalance occurs, the active threads are executing on logical processors that do not have the highest performance potential at that time. Load imbalance can occur only when there are less active threads in the system than there are logical processors in the system(an active thread is a thread that is currently executing).
Consider the system in Figure 1 with two physical processors: A and B; four logical processors: A0, A1, B0, and B1; and two active threads. At any given time, the two threads may be executing on any one of twelve different combinations of the four available logical processors, as expressed in Table 1.
Figure 1: Simple diagram of a system with two physical processors and four logical processors.
Table 1: Possible processor combinations for two threads executing on four logical processors
In this example, optimal performance is attained when each of the two threads is executing on a different physical processor because each thread has exclusive use of all of the shared resources on the processor. Conversely, a load imbalance occurs when both active threads are executing on the same physical processor. When this happens, the two threads are sharing the resources of one physical processor while the other physical processor is idle. The combinations where load imbalance occurs are highlighted in Table 1.
Consider the same system in Figure 1 but with three active threads. At any given time, two threads will be executing on one physical processor while one thread is executing on the other physical processor. If the three threads all do the same type of work, then the thread that executes by itself on one physical processor will achieve higher performance than the two threads executing together on the other processor. This is because the two threads executing together must share some processor resources while the thread executing by itself has exclusive use of that physical processor's shared resources.
It may appear that this example has a load imbalance since one thread may execute faster than the other two. However, the three threads are executing on processors with the highest performance potential, and there is not an ideal configuration that would achieve higher overall performance on the same system with the same number of threads.
Similar examples can be made for 4-, 8-, and 16-processor systems. In general, load imbalance may occur if the difference between the total number of logical processors and the total number of active threads is greater than one.
Load imbalance possible:
nLogicalProcessors - nActiveThreads > 1
Load imbalance is occurring when the difference between the maximum number of threads and minimum number of threads executing on any physical processor is greater than one. Load imbalance cannot occur on a system with only one physical processor.
Load imbalance guaranteed:
maxThreadsPerPhysicalCPU - minThreadsPerPhysicalCPU > 1
Simply put, load imbalance occurs when all active threads are not distributed as evenly as possible acros s all physical processors.
The reason that any one of the twelve combinations in Table 1 is possible at any given time is because of thread migration. Thread migration occurs when a thread is running on one processor, then is preempted by the operating system and later rescheduled on a different processor. Thread migration causes some performance degradation on systems without Hyper-Threading Technology because threads incur extra cache misses every time they are scheduled on a different processor. However, on systems with Hyper-Threading Technology, if the operating system schedules threads without regard to the load on each physical processor, then thread migration can cause processor resource sharing between threads to vary from optimal to worst case.
Avoiding Load Imbalance
Ideally, the operating system schedule would automatically balance active threads between physical processors and developers would not have to worry about load imbalance. Until this is the case for all operating systems, developers of performance-critical applications can use techniques to avoid load imbalance.
The simplest way to avoid load imbalance is to make certain the number of active threads in the system is always greater than or equal to the number of logical processors in the system. Ideally, a multi-threaded application would detect the number of logical processors in the system and spawn the same number of active threads. This technique allows the application to take advantage of all of the processor resources on any system while eliminating load imbalance on systems with Hyper-Threading Technology. However, this method may not be practical or optimal for all applications.
For multi-threaded applications that cannot have as many active threads as logical processors, one method of avoiding load imbalance is by setting processor affinity. Setting processor affinity allows a thread to specify exactly which processor (or processors) that the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of it's active threads, it can guarantee that load imbalance will not occur among its threads as well as eliminate thread migration. For example, if an application has two active threads and is running on the system in Figure 1, the application might set the affinity of Thread 1 to logical processor A0 and the affinity of Thread 2 to logical processor B0. Thread 1 and Thread 2 would always execute on different physical processors and never share resources, thereby achieving optimal performance. (See the Appendix for source code to implement this simplified example.)
Unfortunately, setting processor affinity is not a general solution to processor load imbalance. Some systems may have multiple applications with active threads. If the total number of active threads for all applications is greater than or equal to the number of logical processors in the system, then setting processor affinity is not required because load imbalance does not occur. However, if the number of active threads is less than the number of logical processors, or if multiple applicatio ns running on the same system each set processor affinity for their threads, then multiple threads might end up tied to one processor while other processors are idle. In this case, setting processor affinity may actually cause load imbalance. Blindly setting processor affinity is usually not an acceptable solution to load imbalance. Processor affinity is more suitable for application that will run in a fixed environment.
Some applications rely on the execution timing between threads (if an application starts multiple threads and gives each thread the same amount of work to do, the application expects them to complete their work at the same rate). We know from Example 2 that this may not be the case on systems with Hyper-Threading Technology, so applications should not depend on it. In particular, applications should not use busy-wait loops for synchronization between threads even though the busy-waiting is theoretically very short in systems without Hyper-Threading Technology. Applications should not report an error if the two threads do not complete their work at the same rate, as they might do on a system without Hyper-Threading Technology.
Sometimes optimal performance can be achieved on processors with Hyper-Threading Technology by setting thread affinity based on the functionality of each thread. This functional imbalance is beyond the scope of this paper but the topic is worth mentioning. For example, an application might have two threads-T1 and T2-that perform mostly floating point operations and very few memory accesses, and two more threads-T3 and T4-that perform mostly integer operations and lots of memory accesses. To schedule these four threads for maximum performance, the application could choose to set the processor affinity of the threads for minimum processor resource contention. In the system in Figure 1, this might be accomplished by setting the processor affinities so threads T1 and T3 execute only on logical processors A0 and A1, and threads T2 and T4 execute only on logical processors B0 and B1.
Hyper-Threading Technology enables multiple logical processors to share resources on a single physical processor. Resource sharing causes logical processors to have variable performance compared to processors without Hyper-Threading Technology. One of the side effects of variable performance is load imbalance. Load imbalance can degrade performance when there are less active threads in a system than there are logical processors. Until operating systems are available that evenly balance active threads between physical processors, performance-critical applications can avoid load imbalance by spawning at least as many threads as logical processors in the system. Alternately, applications can use processor affinity to tie threads to specific logical processors and to balance threads between physical processors, but only when enough is known about the particular environment to rule out possible negative impacts of thread affinity.
Appendix: Sample Code
This sample code illustrates how an application can use processor affinity to tie each of its threads to a specific logical or physical processor.