Benefitting Power and Performance Sleep Loops

by Joe Olivas, Mike Chynoweth, & Tom Propst
 

Abstract

To take full advantage of today’s multicore processors, software developers typically break their work into manageable sizes and spread the work across several simultaneously running threads in a thread pool. Performance and power problems in thread pools can occur when the work queue is highly contended by multiple threads requesting work or when threads are waiting on work to appear in the queue.

While there are numerous algorithmic solutions to this problem, this paper discusses one in particular that we have seen as the most commonly used. We also provide some simple recommendations to improve both performance and power consumption without having to redesign an entire implementation.

Overview of the Problem

The popular method for solving performance and power problems is to have each thread in the thread pool continually check for work in a queue and split off to process once work becomes available. This is a very simple approach, but often developers run into problems with the methodologies to poll the queue for work or deal with issues when the queue is highly contended. Issues can occur in two extreme conditions:

  1. The case when the work queue is not filling fast enough with work for the worker threads and they must back-off and wait for work to appear.
  2. The case when many threads are trying to get work from the queue in parallel, causing contention on the lock protecting the queue, and the threads must back off the lock to decrease contention on the lock.

Popular thread implementations have a some pitfalls, yet by making a few simple changes, you'll see big differences in both power and performance.

To start, we make a few assumptions about the workload in question. We assume we have a large dataset, which is evenly and independently divisible in order to eliminate complications that are outside the scope of this study.
 

Details of the Sleep Loop Algorithm

In our example, each thread is trying to access the queue of work, so it is necessary to protect access to that queue with a lock, in order that only a specified number of threads can concurrently get work.

With this added complexity, our algorithm from a single thread view looks like the following:
 


 

Problems with this Approach on Windows* Platforms

Where simple thread pools begin to break down is in the implementation. The key is how to back off the queue when there is no work available or the thread fails to acquire the lock to the queue. The simple approach is to constantly check, otherwise known as a “busy-wait” loop, shown below in pseudo code.

while (!acquire_lock() && work_in_queue);
get_work();
release_lock();
do_work();

Busy Wait Loop

The problem with the implementation above is if a thread cannot obtain the lock or there is no work in the queue, the thread continues checking as fast as possible. Actively polling consumes all of the available processor resources and has very negative impacts on both performance and power consumption. The upside is that the thread will enter the queue almost immediately when the lock is available or when work appears.

Sleep and SwitchToThread

The solution that many developers use for backing off checking the queue, or locks that are highly contended, is typically to call Sleep(0) or SwitchToThread() from the Win32 APIs. According to MSDN Sleep Function documents, calling Sleep(0) allows the calling thread to give up the remaining part of its time slice if and only if a thread of equal or greater priority is ready to run. 

Similarly, SwitchToThread() allows the calling thread to give up the remaining part of its time slice, but only to another thread on the same processor. This means that instead of constant checking, a thread only checks if no other useful work is pending. If you want the software to back off more aggressively, use a Sleep(1) call, which always gives up the remaining time slice, and context switch out, regardless of thread priority or processor residency. The goal of a Sleep(1) is to wake up and recheck in 1 millisecond.

while (!acquire_lock() || no_work_in_queue)
{
  Sleep(0);
}
get_work();
release_lock();
do_work();

Sleep Loop

Unfortunately, a lot more is going on under the hood that can cause some serious performance degradations. The Sleep(0) and SwitchToThread() calls require overhead since they involve a fairly long instruction path length, combined with an expensive ring3 to ring 0 transition costing about 1000 cycles. The processor is fooled into thinking that this “sleep loop” is accomplishing useful work. In executing these instructions, the processor is being fully utilized, filling up the pipeline with instructions, executing them, trashing the cache, and most importantly, using energy that is not benefiting the software.

An additional problem is that a Sleep(1) call probably does not do what you intended if the Windows’ kernel’s tick rate is at the default of 15.6 ms. At the default tick rate, the call is actually equivalent to a sleep that is much larger than 1 ms and can wait as long as 15.6 ms, since a thread can only wake up when the kernel wakes it. Such a call means the thread is inactive for a very long time while the lock could become available or work placed in the queue.

Another issue is that immediately giving up a time slice means the running thread will be context switched out. A context switch costs something on the order of 5,000 cycles, so getting switched out and switched back in means the processor has wasted at least 10,000 cycles of overhead, which is not helping the workload get completed any faster. Very often, these loops lead to very high context switch rates, which are signs of overhead and possible opportunities for performance gains.

Fortunately, you have some options for help mitigating the overhead, saving power, and getting a nice boost in performance.

Spinning Out of Control

If you are using a threading library, you may not have control over the spin algorithms implemented.  During performance analysis, you may see a high volume of context switches, calls to Sleep or SwitchToThread, and high processor utilization tagged to the threading library.  In these situations, it is worth looking at alternative threading libraries to determine if their spin algorithms are more efficient.

Resolving the Problems

The approach we recommend in such an algorithm is akin to a more gradual back off. First, we allow the thread to spin on the lock for a brief period of time, but instead of fully spinning, we use the pause instruction in the loop. Introduced with the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction set, the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.

The pause  instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count.  Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems. This algorithm is shown below:

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())
        {
          goto PROTECTED_CODE;
        }
      }
    }
    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();

Sleep Loop with exponential back off
 

Using pause in the Real World

Using the algorithms described above, including the pause instruction, has shown to have significant positive impacts on both power and performance. For our tests, we used three workloads, each of which had longer periods of active work. The high granularity means the work was relatively extensive, and the threads were not contending for the lock very often. In the low granularity case, the work was quite short, and the threads were more often finishing and ready for further tasks.

These measurements were taken on a 6-core, 12-thread, Intel® Core™ i7 processor 990X  equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0).



Performance using pause

Performance using pause

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As mentioned before, using the pause instruction allows the processor pipeline to be less active when threads are waiting, resulting in the processor using less energy. Because of this, we were also able to measure the power differences between the two algorithms using a Fluke NetDAQ*.

Power Consumption with optimization.

Knowing that your software is saving 0.73W over a standard implementation means it is less likely to be a culprit for draining laptop battery life. Combining reduced energy consumption with the gains in performance can lead to enormous power savings over the lifetime of the workload

Conclusions

In many cases, developers may be overlooking or simply have no way of knowing their applications have hidden performance problems. We were able to get a handle on these performance issues after several years of investigation and measurement.

We hope that this solution is simple enough to be retrofitted into existing software.  It follows common algorithms, but includes a few tweaks that can have large impacts. With battery life and portable devices becoming more prevalent and important to developers, a few software changes can take advantage of new instructions and have positive results for both performance and power consumption. 

About the Authors

Joe Olivas is a Software Engineer at Intel working on software performance optimizations for external software vendors, as well as creating new tools for analysis. He received both his B.S. and M.S. in Computer Science from CSU Sacramento with an emphasis on cryptographic primitives and performance. When Joe is not making software faster, he spends his time working on his house and brewing beer at home with his wife.
Mike Chynoweth is a Software Engineer at Intel, focusing on software performance optimization and analysis. He received his B.S. in Chemical Engineering from the University of Florida. When Mike is not concentrating on new performance analysis methodologies, he is playing didgeridoo, cycling, hiking or spending time with family.

 

Tom Propst is a Software Engineer at Intel focusing on enabling new use cases in business environments. He received his B.S. in Electrical Engineering from Colorado State University. Outside of work, Tom enjoys playing bicycle polo and tinkering with electronics.

 

For more complete information about compiler optimizations, see our Optimization Notice.

8 comments

Top
Verdy, Philippe's picture

Note also that if you optimize a system offering may worker threads, so that most of the time only very few will be really active and others are in a mostly idle state (paced down using mm_pause), then your system is even more easily attackable by time attacks because it will be reactive faster, with a much more predicatable response time.

So it could be more exposed to Meltdown/Spectre peing performed in one of its active threads against other active threads they want to spy on.

One idea to protect the system by adding on the system itself a pseudo-thread running pseudo-code, that will use a significant part of the current total workload: if the system reports that it is 5% active in all threads (except the pseudo-thread), one of these active threads (the attacker) may be measuring the time and needs a timing precision of +/-50% of its own timeslots to infer private data from the hidden state of the other active thread. To cancel this possibility, it's enough that the additoinal pseudo-thread consumes a randomized percentage of the total system workload, with a variation above the 50% needed by the attacker (it will be also enough if there are two active attackers at the same time, because they will actually compete between each other to get their precision)

So just add in the (existing) system "idle" thread a random workload that is using a random workload between 25% and 50% of the current system workload used by the most demanding thread of the system. That system "idle" thread should implement the sleep loop using "mm_pause" (or its equivalent on other non-SSE2 processors). The exact amount of workload to generate in the "idle" system thread is to adapt to the system architecture (multicore or not, hyperthreading or not, NUMA...) and may account for other resources outside the CPU itself (notably the external memory caches, or some critical highspeed buses).

Note that such thing should also applied to the workload of GPUs if applications on the system can delegate a part of their workload to GPUs (same remark about other kinds of "APUs" including "VPUs" in game consoles): the performance can be measured as well on coprocessors, but I'm not sure they even all have the concept of "mm_pause": if they don't have their own instruction set containing some "NOOP" or "PAUSE", to use low-energy mode, and if their response time to performl anyg processing is variable but depends on the data they process, they are as well attackable. Notably if they also use out-of-order execution, or speculative execution, or if they use some external caches shared by CPUs, for example to access the shared memor, or pass by the same bus (controled by the same bus controler or DMA controlers, using also out-of-order or speculative scheduling technics) to access the same external devices (such as network interfaces).

Verdy, Philippe's picture

Note: some mitigations proposed by kernels or APIs and libraries against Spectre or Meltdown are very inefficient: they will insert costly tight loops doing nothing, just to delay some responses with randomized time, and will drain the battery too. These mitigations should use the "mm_pause" intrinsic if they don't want to "yield()" immediately to other potentially dangerous threads measuring their own performance!

As well, the idea of reducing the available resolution of high-perfomance counters to mitigate these attacks is proven now to be completely wrong: not only it does not solve the problem (an high-performance equivalent counter can still be reconstructed using agressive tight loops in attacking threads) but it harms all the rest that need exact performance measurements (notably for realtime events in media players, where it's not acceptable to experiment dropped display frames, or missing audio frames creating very audible noise and poor experience as well in games, or in mission-critical apps like 3D reality in medical or transportation systems, or systems monitoring dangerous industrial processes).

Time attacks like Meltdown and Spectre will soon start to affect other things than just CPUs: it will soon affect as well other multiprocessing environments, notably in cloud computing (due to the rapidly growing performance of Internet and its response times which can now be timed reliably to get information about the normally hidden and protected state of specific connected systems), but as well multi-tier application servers communicating over network grids, even if they are interconnected by very secure encrypted VPNs and enforce the best access restrictions allowing some other network participants to enter the same hardware systems but for other works than those used by the "secured" computing grid.

This is probably already happening on systems connected to the Internet with Gigabit/s speeds or higher (and this now already applies to Internet at home with FTTx, and so happens already for many businesses using the same kind of internet accesses). Already this affects very critical components of the Internet. Notably many routers on the Tier-1 backbone, or the worlwide DNS and PKI infrastructures, including servers for the root DNS zone, and various wellknown sofware distribution servers for critical OS components of Windows, Linux, MacOS, mobile OSes and various appliances, or for automatic updates of antimalware and security suites (all of them using cloud computing or P2P distribution: Rsync, bittorrents, Windows P2P delivery... all depending on a few essential core servers which are easily attackable now by time-attacks, with very cheap cost for attackers and difficulty to detect them on servers, simply because of their verdy fast Internet connection, which is for now only semi-protected against typical DOS attacks).

Verdy, Philippe's picture

Also I hope that the intrinsic function mm_pause() or the equivalent SSE instruction does not prohibit a thread to be preempted or interrupted: after servicing the interrupt by the kernel within the ISR (which may then yield to other competing threads ready to run) the OS should discard the mm_pause and not reexecute it.

(The context switch caused by the interrupt and the time the ISR is running and other possible concurrent threads have run, and then context switched back to the interrupted initial thread using mm_pause, lots of cycles will have passed, much more than what mm_pause would have used itself by idling the execution pipeline in low-power mode. So there's no advantage at all for reissuing the same mm_pause when switching back: the kernel should discard the interrupted mm_pause and resume the execution to the instruction just after it)

So instead of using SSE2 only code in applications, just use the appropriate OS API or the popular pthread library: it will be more portable and notably on the many mobile devices where ARM processors are now prevalent and saving energy is extremely important. (this kind of optimization should already be implemented in the standard API for Android and iOS, and mobile Windows: just look at how to wait for async I/O events and notably completion.

As well, waiting for a mutex should be available as a blocking call (with an optional timeout parameter to allow still waking up after a reasonable delay, even if the mutex is still not acquired), and should be implemented in your favorite library for mutexes.

This optimization should also be already performed internally in all standard blocking I/O functions (like "open()", "read()", "write()", "iocontrol()", "flush()", "close()"... that internally will use an async I/O with some mutex lock and an optimized spinning loop, and will yield smartly to other competing threads without needing them to be agressively preempted by the OS kernel, but very late after only a large timeslot spent only by costly spinning)

So forget "mm_pause()" in most apps (and notably within spinning loops).

Use it only as an optimization hint given by the app to the OS (or to a debugger) that a thread does not need or want to run so fast and that it can be paced down (the "mm_pause()" intrinsic should be replaced by an API or library call like "pace_down()", whose implementation may look at performance counters to see if it is used too often in order to slow down the calling thread and save energy used by it, or should yield more often to other competing threads that don't use "pace_down()" so often by them so their running priority can be boosted).

Then any thread can use "mm_pause()" then anywhere in algorithms that are known to run agressively for long period of time (e.g. very long computations) and can easily drain a battery too fast. Internally it will conditionally "yield()" as needed and measured by performance counters or other info or hints given by the OS.

"yield()" (itself using the "mm_pause" intrinsic or similar instructions, and other hints given by the OS) may also apply some security measures against time attacks by applying randomized delays (e.g. against CPU's optimizations like in pipelines and caches : out-of-order execution, i.e. the Meltdown attacks, and speculative execution, i.e. the Spectre attacks) which are visible each time there are multiple threads competiung to get some timeslots of execution. And this could help threads to secure themselves against the same attacks (notably those worker threads handling responses to restricted services or handling large volume or private data, or servicing lot of competing network requests on servers)

Verdy, Philippe's picture

There is another possibility:

- on Windows you can use "WaitForMultipleEvents()" API to wait for a job or any signal to wake up a worker thread. This can also be used to wait for asynchronous I/O to complete, or wait for an incoming socket connection (sockets can be bound to local host, not necessarily bound to any network address), or wait for a maximum timeout, ot wait for the completion of some thread (i.e. joining with it), or waiting for an exclusive lock on a mutex and any other kind of rendez-vous.

- on Unix/Linux, you can use a "select()" API to wait for an incoming socket connection as well (some Linux also allow them to wait for async I/O by using other standard file handles, or you can use the async I/O API to do the same thing).

Normally updated version of the "pthread" library should provide such facility (and "pthread_join()" is also using a spinning loop, waiting for a thread completion as an additional event that will cause the calling thread to yield without taking much global resources and wake up rapidly.

Internally all these APIs (or user libraries) should be already implemented using "_mm_pause()" or some "Sleep()" equivalent in their optimized spinning loops !

Egorushkin, Maxim's picture

Now that the latency of PAUSE instruction on Skylake increased from around 10 cycles to 140 cycles, it would be interesting to see the benchmark results on Broadwell and Skylake. 

MPL3D's picture

Very useful.

Many thanks for publishing these tips, looking forward to test it in our projects.

All the best,

MPL3D Team

 

 

Jdeprat's picture

interesting! thanks!

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.