Benefitting Power and Performance Sleep Loops

by Joe Olivas & Mike Chynoweth

Abstract

In order to take full advantage of today’s multicore processors, it has become commonplace for software developers to break their work into manageable sizes, and spread the work across several simultaneously running threads in a thread pool. Performance and power problems in thread pools typically occur when work queue is highly contended by multiple threads requesting work or when threads are waiting on work to appear in the queue. While there are numerous approaches to this problem algorithmically, we take a look at one in particular which we have seen as the most commonly used and make simple recommendations to improve both performance and power consumption without having to redesign the entire implementation.

Overview of the Problem

In order to take full advantage of today’s multicore processors, it has become commonplace for software developers to break their work into manageable sizes, and spread the work across several simultaneously running threads in a thread pool. Performance and power problems in thread pools typically occur when work queue is highly contended by multiple threads requesting work or when threads are waiting on work to appear in the queue. While there are numerous approaches to this problem algorithmically, we take a look at one in particular which we have seen as the most commonly used and make simple recommendations to improve both performance and power consumption without having to redesign the entire implementation.

The popular method to solve this problem is to have each thread in the thread pool continually check for work in a queue and split off to process once work becomes available. This is a very simple approach but often developers run into problems with the methodologies to poll the queue for work or how to deal with issues when the queue is highly contended. Issues can occur in two extreme conditions

  1. The case when the work queue is not filling fast enough with work for the worker threads and they must back-off and wait for work to appear
  2. The case when there are many threads attempting to get work from the queue in parallel causing contention on the lock protecting the queue and the threads must back off the lock in order to decrease contention on the lock.

It is the goal of this paper to describe the pitfalls of the popular thread pool implementations, and how a few simple changes can make big differences in both power and performance.

To start, we make a few assumptions about the workload in question. For this paper, we assume we have a large dataset, which is evenly and independently divisible in order to eliminate complications which are outside the scope of this study.

Details of the Sleep Loop Algorithm

In our example, each thread is trying to access the queue of work, so it is necessary to protect access to that queue with a lock, so that only a specified number of threads can concurrently get work.

With this added complexity, our algorithm from a single thread view looks like the following:



Problems with this Approach on Windows* Platforms

Where simple thread pools begin to break down is in the implementation. The key is how to back off the queue when there is no work available or the thread fails to acquire the lock to the queue. The simple approach is to constantly check, otherwise known as a “busy-wait” loop, shown below in pseudo code.

while (!acquire_lock() && work_in_queue);
get_work();
release_lock();
do_work();

Table 1 Busy Wait Loop

The problem with the implementation above is if a thread cannot obtain the lock or there is no work in the queue, the thread would continue checking as fast as possible. Actively polling consumes all of the available processor resources, and having very negative impacts on both performance and power consumption. The upside is that the thread will enter the queue almost immediately when the lock is available or when work appears.

The solution that many developers use for backing off checking the queue, or locks that are highly contended, is typically to call Sleep(0) from the Win32 APIs. According to the MSDN documents (Microsoft Corporation, 2010)*, calling Sleep(0) will allow the calling thread to give up the remaining part of its time slice if and only if a thread of equal or greater priority is ready to run. This means that instead of constant checking, a thread will only check if no other useful work is pending. If the software needs to back off more aggressively, developers tend to use a Sleep(1) call, which will always give up the remaining time slice, and context switch out. The goal of a Sleep(1) is waking up and rechecking in 1 millisecond.

while (!acquire_lock() || no_work_in_queue)
{
  Sleep(0);
}
get_work();
release_lock();
do_work();

Table 2 Sleep Loop

Unfortunately, there is a lot more that goes on under the hood that can cause some serious performance degradations. The Sleep(0) call requires overhead since it involves a fairly long instruction path length, combined with an expensive ring3 to ring 0 transition costing about 1000 cycles. The processor is fooled into thinking that this “sleep loop” is accomplishing useful work. In executing these instructions, the processor is being fully utilized, filling up the pipeline with instructions, executing them, trashing the cache, and most importantly, using energy which is not benefiting the software. As an additional problem, a Sleep(1) call probably does not do what the developer intended, if the Windows’ kernel’s tick rate is at the default of 15.6 ms. At the default tick rate, the call is actually equivalent to a sleep that is much larger than 1 ms and can wait as long as 15.6 ms, since a thread can only wake up when the kernel wakes it. Such a call means the thread is inactive for a very long time while the lock could become available or work placed in the queue.

Another issue is that immediately giving up a time slice means the running thread will be context switched out. A context switch costs something on the order of 5,000 cycles, so getting switched out and switched back in means the processor has wasted at least 10,000 cycles of overhead, which is not helping the workload get completed any faster. Very often, these loops lead to very high context switch rates, which is a sign that there is overhead and possible opportunities for performance gains.

Fortunately, there are alternatives which will help mitigate the overhead, save power and give a nice boost in performance.

Resolving the Problems

The approach we recommend in such an algorithm is akin to a more gradual back off. First, we allow the thread to spin on the lock for a brief period of time, but instead of fully spinning, we introduce the pause instruction into the loop. Introduced with the SSE2 instruction set (Intel Corporation, 2010), the pause instruction was introduced to give a hint to the processor that the calling thread was in a "spin-wait" loop. In addition, using the pause instruction is that on x86 architectures that do not support SSE2, the instruction is a no-op, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support SSE2 won’t see the benefits of the pause, it also means that developers can keep one straightforward code path that works across the board. Essentially, this instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor. This can be used in conjunction with a Sleep(0), to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring3. This algorithm is shown below:

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())
        {
          goto PROTECTED_CODE;
        }
      }
    }
    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();

Table 3 Sleep Loop with exponential back off.

Using pause in the Real World

Using the algorithms described above, including the pause instruction has shown to have significant positive impacts on both power and performance. For our tests, we used three workloads, each of which had longer periods of active work. The high granularity means the work was relatively extensive, and the threads were not contending for the lock very often. In the low granularity case, the work was quite short, and the threads were more often finishing and ready for further tasks.

These measurements were taken on a 6-core, 12-thread Core i7-990X equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0).



Table 4 Performance using pause



Table 5 Performance using pause

As mentioned before, using the pause instruction allows the processor pipeline to be less active when threads are waiting, with the processor using less energy than it would otherwise. Because of this, we were also able to measure the power differences between the two algorithms using a Fluke NetDAQ.



Table 6 Power Consumption with optimization

Knowing that your software is saving 0.73W over a standard implementation means your software is less likely to be a culprit for draining laptop battery life. Combining its active energy consumption with the gains in performance can lead to enormous power savings over the lifetime of the workload.

Conclusions

In many cases like the one discussed in this paper, there can be many hidden performance problems that many developers may overlook, or simple have no way of knowing. We were able to get a handle on these performance issues after several years of investigation and measurement.

We hope that this solution is simple enough to be retrofitted into existing software. It follows common algorithms, but includes a few tweaks that can have large impacts. With battery life and portable devices becoming more prevalent and important to developers, with a few considerations, software can take advantage of new instructions and have positive results for both performance and power consumption.

About the Authors

Joe Olivas is a Software Engineer at Intel working on software performance optimizations for external software vendors, as well as creating new tools for analysis. He received both his B.S. and M.S. in Computer Science from CSU Sacramento with an emphasis on cryptographic primitives and performance. When Joe is not making software faster, he spends his time working on his house and brewing beer at home with his wife.
Mike Chynowethis also a Software Engineer at Intel, focusing on software performance optimization and analysis. He received his B.S in Chemical Engineering from the University of Florida. When Mike is not concentrating on new performance analysis methodologies, he is playing didgeridoo, cycling, hiking or spending time with family.

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.