Benefitting Power and Performance Sleep Loops

by Joe Olivas, Mike Chynoweth, & Tom Propst
 

Abstract

To take full advantage of today’s multicore processors, software developers typically break their work into manageable sizes and spread the work across several simultaneously running threads in a thread pool. Performance and power problems in thread pools can occur when the work queue is highly contended by multiple threads requesting work or when threads are waiting on work to appear in the queue.

While there are numerous algorithmic solutions to this problem, this paper discusses one in particular that we have seen as the most commonly used. We also provide some simple recommendations to improve both performance and power consumption without having to redesign an entire implementation.

Overview of the Problem

The popular method for solving performance and power problems is to have each thread in the thread pool continually check for work in a queue and split off to process once work becomes available. This is a very simple approach, but often developers run into problems with the methodologies to poll the queue for work or deal with issues when the queue is highly contended. Issues can occur in two extreme conditions:

  1. The case when the work queue is not filling fast enough with work for the worker threads and they must back-off and wait for work to appear.
  2. The case when many threads are trying to get work from the queue in parallel, causing contention on the lock protecting the queue, and the threads must back off the lock to decrease contention on the lock.

Popular thread implementations have a some pitfalls, yet by making a few simple changes, you'll see big differences in both power and performance.

To start, we make a few assumptions about the workload in question. We assume we have a large dataset, which is evenly and independently divisible in order to eliminate complications that are outside the scope of this study.
 

Details of the Sleep Loop Algorithm

In our example, each thread is trying to access the queue of work, so it is necessary to protect access to that queue with a lock, in order that only a specified number of threads can concurrently get work.

With this added complexity, our algorithm from a single thread view looks like the following:
 


 

Problems with this Approach on Windows* Platforms

Where simple thread pools begin to break down is in the implementation. The key is how to back off the queue when there is no work available or the thread fails to acquire the lock to the queue. The simple approach is to constantly check, otherwise known as a “busy-wait” loop, shown below in pseudo code.

while (!acquire_lock() && work_in_queue);
get_work();
release_lock();
do_work();

Busy Wait Loop

The problem with the implementation above is if a thread cannot obtain the lock or there is no work in the queue, the thread continues checking as fast as possible. Actively polling consumes all of the available processor resources and has very negative impacts on both performance and power consumption. The upside is that the thread will enter the queue almost immediately when the lock is available or when work appears.

Sleep and SwitchToThread

The solution that many developers use for backing off checking the queue, or locks that are highly contended, is typically to call Sleep(0) or SwitchToThread() from the Win32 APIs. According to MSDN Sleep Function documents, calling Sleep(0) allows the calling thread to give up the remaining part of its time slice if and only if a thread of equal or greater priority is ready to run. 

Similarly, SwitchToThread() allows the calling thread to give up the remaining part of its time slice, but only to another thread on the same processor. This means that instead of constant checking, a thread only checks if no other useful work is pending. If you want the software to back off more aggressively, use a Sleep(1) call, which always gives up the remaining time slice, and context switch out, regardless of thread priority or processor residency. The goal of a Sleep(1) is to wake up and recheck in 1 millisecond.

while (!acquire_lock() || no_work_in_queue)
{
  Sleep(0);
}
get_work();
release_lock();
do_work();

Sleep Loop

Unfortunately, a lot more is going on under the hood that can cause some serious performance degradations. The Sleep(0) and SwitchToThread() calls require overhead since they involve a fairly long instruction path length, combined with an expensive ring3 to ring 0 transition costing about 1000 cycles. The processor is fooled into thinking that this “sleep loop” is accomplishing useful work. In executing these instructions, the processor is being fully utilized, filling up the pipeline with instructions, executing them, trashing the cache, and most importantly, using energy that is not benefiting the software.

An additional problem is that a Sleep(1) call probably does not do what you intended if the Windows’ kernel’s tick rate is at the default of 15.6 ms. At the default tick rate, the call is actually equivalent to a sleep that is much larger than 1 ms and can wait as long as 15.6 ms, since a thread can only wake up when the kernel wakes it. Such a call means the thread is inactive for a very long time while the lock could become available or work placed in the queue.

Another issue is that immediately giving up a time slice means the running thread will be context switched out. A context switch costs something on the order of 5,000 cycles, so getting switched out and switched back in means the processor has wasted at least 10,000 cycles of overhead, which is not helping the workload get completed any faster. Very often, these loops lead to very high context switch rates, which are signs of overhead and possible opportunities for performance gains.

Fortunately, you have some options for help mitigating the overhead, saving power, and getting a nice boost in performance.

Spinning Out of Control

If you are using a threading library, you may not have control over the spin algorithms implemented.  During performance analysis, you may see a high volume of context switches, calls to Sleep or SwitchToThread, and high processor utilization tagged to the threading library.  In these situations, it is worth looking at alternative threading libraries to determine if their spin algorithms are more efficient.

Resolving the Problems

The approach we recommend in such an algorithm is akin to a more gradual back off. First, we allow the thread to spin on the lock for a brief period of time, but instead of fully spinning, we use the pause instruction in the loop. Introduced with the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction set, the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.

The pause  instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count.  Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems. This algorithm is shown below:

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())
        {
          goto PROTECTED_CODE;
        }
      }
    }
    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();

Sleep Loop with exponential back off
 

Using pause in the Real World

Using the algorithms described above, including the pause instruction, has shown to have significant positive impacts on both power and performance. For our tests, we used three workloads, each of which had longer periods of active work. The high granularity means the work was relatively extensive, and the threads were not contending for the lock very often. In the low granularity case, the work was quite short, and the threads were more often finishing and ready for further tasks.

These measurements were taken on a 6-core, 12-thread, Intel® Core™ i7 processor 990X  equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0).



Performance using pause

Performance using pause

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As mentioned before, using the pause instruction allows the processor pipeline to be less active when threads are waiting, resulting in the processor using less energy. Because of this, we were also able to measure the power differences between the two algorithms using a Fluke NetDAQ*.

Power Consumption with optimization.

Knowing that your software is saving 0.73W over a standard implementation means it is less likely to be a culprit for draining laptop battery life. Combining reduced energy consumption with the gains in performance can lead to enormous power savings over the lifetime of the workload

Conclusions

In many cases, developers may be overlooking or simply have no way of knowing their applications have hidden performance problems. We were able to get a handle on these performance issues after several years of investigation and measurement.

We hope that this solution is simple enough to be retrofitted into existing software.  It follows common algorithms, but includes a few tweaks that can have large impacts. With battery life and portable devices becoming more prevalent and important to developers, a few software changes can take advantage of new instructions and have positive results for both performance and power consumption. 

About the Authors

Joe Olivas is a Software Engineer at Intel working on software performance optimizations for external software vendors, as well as creating new tools for analysis. He received both his B.S. and M.S. in Computer Science from CSU Sacramento with an emphasis on cryptographic primitives and performance. When Joe is not making software faster, he spends his time working on his house and brewing beer at home with his wife.
Mike Chynoweth is a Software Engineer at Intel, focusing on software performance optimization and analysis. He received his B.S. in Chemical Engineering from the University of Florida. When Mike is not concentrating on new performance analysis methodologies, he is playing didgeridoo, cycling, hiking or spending time with family.

 

Tom Propst is a Software Engineer at Intel focusing on enabling new use cases in business environments. He received his B.S. in Electrical Engineering from Colorado State University. Outside of work, Tom enjoys playing bicycle polo and tinkering with electronics.

 

For more complete information about compiler optimizations, see our Optimization Notice.