Synchronization techniques help threads perform their work without undesired interference from other threads. However, there are important differences between different types of primitives, especially as regards performance. So, what are the best ways to synchronize threads?
As multicore technology pushes desktop software into the world of multithreaded parallel programming, developers are confronted with a new set of coding issues. Chief among these is thread synchronization. The good news is that there are many ways to synchronize threads. So much so, that part of the skill of parallel programming is determining which synchronization primitive to use.
Synchronization primitives-the key building blocks of process and thread management-protect access to a resource, by blocking access by more than one thread at a time. Much of the skill of writing parallel applications comes from the ability to use synchronization primitives optimally. Best practices have typically focused on choosing an appropriate primitive and finding ways to lock an item for as short a period as possible. However, one aspect that is often overlooked in developing efficient code is the processing overhead of the chosen primitive. The cost of synchronization primitive overhead varies widely on most operating systems today; and so developers looking to optimize performance need to consider carefully the performance implications of their choices. This factor is most apparent on Microsoft Windows* operating systems due to the remarkable wealth of threading APIs and primitives-whatever the thread’s needs there is likely a specific solution available.
This article examines some of the most used synchronization primitives on the Windows platform (with occasional references to Linux*/Pthreads equivalents) and provides recommendations regarding which to choose when performance is an important consideration. I assume a basic knowledge of synchronization primitives. If you need more information on how Windows handles threads, the titles in the Resources section at the end of this article can help.
Synchronization primitives vary widely by the amount of overhead processing they require. I’ll examine three common primitives that involve decreasing amounts of overhead: the mutex, which is the heaviest (especially on Windows); the critical section, which is much lighter; and atomic primitives, which are very light.
The most commonly used lock is a mutex (found in all major threading interfaces such as Windows threads and Linux/Pthread), which blocks access to a resource, such as a data item, or a piece of code. While mutexes are commonly found, there are other similar locking mechanisms. The semaphore is a lock that can keep track of the number of times it has been locked and so manage thread access when it is unlocked. It is found on Windows, Linux (although not strictly a part of Pthreads) and most other operating systems.
A lock with less overhead is the critical section. The term “critical section” means slightly different things in Windows and Linux/Pthreads. In the latter, it means a section of code, whose access can be locked. In Windows, i t is the lock itself.
Finally, atomic primitives are synchronization functions that are limited to small, quick activities that can be performed atomically, that is, as they occurred in a single operation. Typically, they are used to increment or decrement counters or change the value of a variable. In Windows, they are implemented through what are known as “interlocked” functions.
Mutexes are heavyweight locks in Windows. They are kernel operations and so require a system call to create (and lock) and an additional system call to release. This implementation has benefits and drawbacks. The big benefit is that mutexes are visible across processes–that is, across multiple programs–in Windows (Note: This is not so in Linux.). So, if a Windows resource can be accessed by two different processes, the mutex is the right way to lock it. The downside of the mutex is that the system call to the kernel is very expensive, performance-wise.
The cost of mutexes increases when a lock is heavily contended, that is, when many threads are waiting for it to release access to a resource. The specific costs are presented in Table 1. It is safe to say that mutexes should be avoided on Windows, unless needed for cross process locking. As much as possible critical sections should be preferred whenever chunks of code or data objects need to be accessed.
Critical sections are similar to mutexes, but they operate within the user space. That is, on Windows they do not require a switch to kernel mode. As a result they tend to be an order of magnitude faster than mutexes. Sometimes more. However, they are no longer visible across processes. In many cases, this factor is not a limitation.
Sometimes, however, all that needs to be locked is access to a single numeric variable. In this case, a critical section-while much faster than a mutex-is still overkill. Windows provides a series of APIs that permit a variable’s value to be modified as if it occurred in a single atomic action. These enable the value to be changed very quickly; and during the update, Windows will assure that no other thread accesses the variable. These atomic APIs or interlocked functions are an excellent performance option. (See Resources section for links to the specific APIs.) They tend to take about half the time of critical sections and represent the fastest synchronization primitives.
In Gerber and Binstock (see Resources), timings were done for accessing 10 million long integers, while using a mutex, a critical section, and interlocks. The timings were also compared with access to the integers without any locking mechanism at all. The results, from a Dual-Core Intel® Xeon® processor workstation, converted to milliseconds are shown in Table 1. (The code for this test can be found at http://pacificdataworks.com/pub/WindowsTimings.cpp)
|Critical sections||1,889 ms.|
|Interlocked functions||944 ms.|
|No locks||70 ms.|
Table 1. Timings of locking mechanisms without contention
These numbers show the heavy cost of mutexes and the comparative benefits of critical sections and interlocked functions on Windows. However, in the real world, the numbers are even more skewed because the numbers above were done with no lock contention. When locks are contended by many threads, example test numbers on various x86 platforms running Windows show the ranges presented in Table 2.
|Critical sections with contention:||50x – 100x interlocks|
|Mutexes with contention:||100+ x interlocks|
Table 2. With contention: performance relative to interlocked functions
The results are clear and compelling: use interlocks where you can, critical sections everywhere else unless you need interprocess locking, in which case, you should use mutexes. In all cases, however, try to avoid anything but interlocks in a tight loop.
The performance of your specific code may vary significantly from the numbers shown in this article. For example, Microsoft changed the way locks are allocated in Vista*. It uses a more performance-oriented algorithm than in earlier versions of Windows. So, timings that appear one way in Windows might be very different in Vista.
Threading performance is sensitive to specific application design and to the underlying platform, so it’s important to test the performance of your code yourself and find and resolve the hot spots in your application. The Intel VTune Performance Analyzer and other threading tools are particularly helpful in this.
As the results in Table 1 attest, the use of an overly heavy lock can also cause important delays. By working with different locking mechanisms, it becomes increasingly easy to identify which is the lightest solution for the job at hand. This experience can also spill over into algorithm design: you might choose different data structures or inter-thread communication mechanisms so as to benefit from interlocked functions, for example.
In all cases, the more you bear in mind the cost of your synchronization primitives, the faster your code will be.
Jeffrey Richter, Programming Applications for Microsoft Windows, Fourth Edition, ISBN 1-57231-996-8 is considered by many to be the definitive introduction to thread ing APIs on Windows. Start here.
Richard Gerber and Binstock, Andrew, Programming with Hyper-Threading Technology: How to Write Multithreaded Software for Intel IA-32 Processors, ISBN 0-9717861-4-3. This book describes Windows and Linux/Pthreads APIs in detail in the context of Hyper-Threading, but the discussion of threading and synchronization primitives applies equally to multicore.
MSDN, Interlocked Variable Access, http://msdn2.microsoft.com/en-us/library/ms684122.aspx A good overview of the specific interlocked functions in the Windows API, written by Microsoft’s technical staff.
Srinivasan S. Muthuswamy and Varadarajan, Kavitha, "Port Windows IPC apps to Linux, Part 3: Mutexes, critical sections, and wait functions" http://www-128.ibm.com/developerworks/linux/library/l-ipc2lin3.html is a good comparison of Windows threading functions and their counterparts on Linux.
Andrew Binstock is the principal analyst at Pacific Data Works LLC, where he frequently writes technical white papers on software-development topics. His blog is at: http://binstock.blogspot.com
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.