Implementing Scalable Atomic Locks for Multi-Core Intel® EM64T and IA32 Architectures

by Michael Chynoweth and Mary R. Lee

Abstract

There are several different methods of atomically locking code and data on a Microsoft Windows platform. The primary purpose of this whitepaper is to give developers a brief introduction to the different methods of locking in Windows and the associated performance costs associated with those locks. This information is particularly applicable since future architectures will be multi-core.

Introduction

Multi-threading software applications is critical for increasing performance for Intel’s Core Architecture. Locking code can frequently be the most frequently run code in a multi-threaded application. Determining which methodology of locking to utilize can be as critical as identification of parallelism within an application. The primary purpose of this whitepaper is to give developers a brief introduction to the different methods of locking in Windows and the associated performance costs associated with those locks. Some Window’s locking APIs have the possibility of jumping into the operating system kernel. This white-paper will detail which ones jump into the kernel and under which conditions.

Two different locking kernels are used to demonstrate the impact of locks representing different granularities. The first locking kernel emulates a scenario of locking and unlocking a dynamically linked list which would commonly be used to maintain a freelist by a memory manager. For a multi-threaded application, it would be necessary to first lock the list before a thread attempts to malloc or free memory. The second locking kernel represents a more granular lock in that it simply obtains the ThreadID of the thread which has obtained the lock, updates a global variable, and then releases the lock. The performance of the different locks under low and high contention scenarios is tested through scaling the number from 1 to 64 threads. Each thread obtains the lock 10,000,000 times performs an operation and then releases the lock. The OS timing has been changed for Windows XP from 10 ms to 1 ms for these experiments.

WaitForSingleObject and EnterCriticalSection

The two most popular methods of locking on the Microsoft Windows platform are WaitForSingleObject and EnterCriticalSection. WaitForSingleObject is an overloaded Microsoft API which can be used to check and modify the state of a number of different objects such as events, jobs, mutexes, processes, semaphores, threads, or timers. One disadvantage of WaitForSingleObject is that it will always obtain a kernel lock, so it enters privileged mode (ring 0) whether the lock is achieved or not. This API also enters the Windows kernel even if a 0 timeout is specified. Another disadvantage of this method of locking is that it can only handle 64 threads attempting to place a lock on an object at once. The advantage of WaitForSingleObject is that it can be processed globally, which enables this API to be used for synchronization between processes. It also has the advantage of giving the OS knowledge of the locking object allowing for fairness and priority inversion.

EnterCriticalSection can be used by putting an EnterCriticalSection and LeaveCriticalSection API call surroundin g the critical section code. This API has the advantage over WaitForSingleObject in that it will not enter the kernel unless there is contention on the lock. If there is no contention on the lock, then the API will obtain the lock in the user space and return without entering privileged mode. If there is contention, then it will follow very similar paths as WaitForSingleObject within the kernel. Under circumstances of low contention EnterCriticalSection is a much cheaper lock since it does not enter the kernel.

The disadvantages are that EnterCriticalSection cannot be processed globally, and there is no guarantee on the order which threads obtain the lock. EnterCritical section is a blocking call, meaning that it will not return until a thread can gain access to the critical section. Windows has introduced TryEnterCriticalSection, which is non-blocking and will return immediately with the lock achieved or not. EnterCriticalSection also allows for developer to initialize the critical section with a spin count which the thread will attempt to achieve the lock before backing off. This is accomplished through using the API InitializeCriticalSectionAndSpinCount. The spin count can be set in this call or in the registry to change the spin based on different operating systems and their different quanta for threads.

Both EnterCriticalSection and WaitForSingleObject will enter the kernel if there is contention on the lock. The transition from user mode to privileged mode can be costly if accomplished excessively.

The EnterCriticalSection and WaitForSingleObject API calls usually will not impact performance when locking for operations which cost thousands of cycles. In these cases the cost of the locking call itself will not be as prominent. Where negative performance can be incurred is on granular locking where the lock is achieved and released in hundreds of cycles. In these cases it can be beneficial to utilize a user level lock.

To demonstrate the cost of WaitForSingleObject vs. EnterCriticalSection calls under low contention we ran through the memory management locking kernel at 1 and 2 threads. At low contention the speedup (WaitForSingleObject_Time / EnterCriticalSection_Time) is around 5x difference in performance. At 2 threads with constant contention the difference between using EnterCriticalSection and WaitForSingleObject is minimal. The difference in performance under low contention is due to WaitForSingleObject entering the kernel every call vs. EnterCriticalSection which only enters the kernel if there is contention on the lock.

Number of Threads EnterCriticalSection Time (ms) WaitForSingleObject Time (ms) Speedup
1 Thread (No Contention) 1781 9187 5.2
2 Thread (Contention) 53594 58156 1.1

 

Figure 1: Shows the memory management kernel for 1 and 2 threads for EnterCriticalSection and WaitForSingleObject. EnterCriticalSection is much faster under 1 thread (no contention) since it will not jump into the kernel (privileged mode) if the lock is achieved.

Figure 2 below demonstrates the cost of EnterCriticalSection and WaitForSingleObject under high contention scenarios scaling from 1 to 64 threads. In this experiment we are locking and unlocking a dynamically linked list while pushing and popping values from the list. The purpose is to mimic a memory allocator Freelist which is frequently being locked for malloc or free. The kernel is aggressive about context switching out the threads that are vying for the lock so the average CPU load is 22% in both experiments. It is clear that under a high contention scenario, the cost of utilizing EnterCriticalSection and WaitForSingleObject are very similar.



Figure 2: Shows the memory management kernel for 1 to 64 threads for EnterCriticalSection and WaitForSingleObject. WaitForSingleObject and EnterCriticalSection have similar costs associated with them under high contention scenarios.

Collecting clockticks event via event based sampling utilizing Intel VTune Analyzer® can be useful to help determine how much contention is occurring on EnterCriticalSection and WaitForSingleObject. Before attempting this experiment a developer needs to ensure that the correct kernel symbols have been downloaded. Instructions on how to download kernel symbols can be obtained on Microsoft’s developer’s network, MSDN.

Any time spent in the kernel (ntoskrnl.exe or ntkrnlpa.exe) when utilizing EnterCriticalSection is a sign that there is contention occurring. When there is no contention EnterCriticalSection and LeaveCriticalSection calls will spend most of the time in ntdll.dll in the functions RtlEnterCriticalSection and RtlLeaveCriticalSection respectively. NTDLL.DLL is a dynamically linked library running in ring 3 (non-privileged mode) of the processor. The library, NTDLL.DLL, contains much of the runtime library (RTL) code utilized by an application and should not be confused with the OS kernel. Once EnterCriticalSection has encountered contention it will follow very similar paths to WaitForSingleObject. Without going into details of these kernel functions it is possible to tell if there is high contention on EnterCriticalSection through looking at the functions contained within hal.dll, ntdll.dll, and ntkrnlpa.exe (or ntoskrnl.exe) shown in Figure 3.



Figure 3: Shows the hot functions within the Windows OS kernel, ntdll.dll, and hal.dll under high contention for EnterCriticalSection and WaitForSingleObject.

WaitForSingleObject will always jump into the Windows OS kernel but it is still possible to determine if there is high contention on these locks utilizing this call as well. WaitForSingleObject will follow different paths within the kernel, ntdll.dll, and hal.dll when the locks are under high contention. These paths are associated w ith the kernel context switching out the thread since it is not able to obtain the required lock. In particular KiDispatchInterrupt (OS Kernel), ZwYieldExecution (OS kernel), KiDispatchInterrupt (OS kernel), HalRequestlpi (hal.dll), and HalClearSoftwareInterrupt (hal.dll) are good functions to watch which may be a sign of contention on WaitForSingleObject or EnterCriticalSection in the kernel.

Figure 4: Shows the hot functions within the Windows OS kernel, ntdll.dll, and hal.dll under no contention and high contention for WaitForSingleObject call.

Since both EnterCriticalSection and WaitForSingleObject will enter the kernel if there is contention on the lock, a user level lock is preferable on locks for operations which are very granular and locks under high contention.

User Level Atomic Locks

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. EnterCriticalSection utilizes atomic instructions to attempt to get a user-land lock before jumping into the kernel. On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance. Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines. Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

Figure 3 contains a simple lock written in a few lines of assembly to demonstrate usage of utilizing an atomic instruction with a lock prefix for obtaining a lock. In this example, the code is simply testing a memory space pointed to attempt to get a lock. If the memory space contains a 1, it means another thread has already obtained the lock. If the memory space is 0, it means the lock is free. The atomic xchg instruction is used to attempt to do an exchange of 1 with the memory space. If eax contains 0 after the xchg instruction, it means that the lock was achieved by the current thread. If the eax contains a 1 after the atomic xchg instruction this signifies that another thread already has the lock.

//Note: edx register contains address of lock variable


// move 1
 
into eax register

mov eax, 1


// xchg 1 with value contained in dereferenced edx

lock xchg	

DWORD PTR[edx],eax


// test if zero

test eax,eax


// jump if not zero

jne Target

 

Figure 5: Shows the assembly of a simple mutex lock.

It is not necessary to write assembly to take advantage of user land locks which utilize the lock prefix. Microsoft provides access to the most frequently used atomic instructions for synchronization through the “Interlocked” APIs InterlockedExchange, InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange, and InterlockedExchangeAdd. These APIs all reside in kernel32.dll which is will be loaded by the application in ring3 (non-priveleged mode). Many developers confuse kernel32.dll time for being kernel time due to its confusing name, but this dll runs entirely in ring 3 (user mode). It does serve as a gateway for APIs to jump into the Windows kernel. The Interlocked functions do not have any possibility of jumping into the Windows kernel (privileged mode).

To demonstrate the potential cost of utilizing an expensive lock such as WaitForSingleObject vs. a user-level lock, the memory management kernel was run with a simple user-level spin-wait lock with back-off. Under high and low contention scenarios, the user-level spin-wait is several orders of magnitude cheaper. For this reason a user-level lock is preferable for frequently called granular locking.



Figure 6: Cost of an inlined user-level spin-wait lock vs. WaitForSingleObject on the memory management locking kernel.

Spin-Wait Loops on User Level Atomic Locks

One disadvantage of utilizing the user mode locks is that the kernel cannot be used to provide a spin-wait. This means that an application attempting to use xchg or cmpxchg to get a lock will have to either move on to do other work if the lock is not received or spin on the lock. When a programmer wants to implement a spin-wait loop in user-mode it is preferable to utilize TryEnterCriticalSection with an initialized spin-count or use third party locking libraries such as those provided in Intel’s Thread Building Blocks. Ideally it is always preferable to have the thread move on to do other work when the lock cannot be achieved instead of using any spin-waits at all.

To better understand the construct of a spin-wait loop, one was created for the locking kernels. In a spin-wait loop, an atomic instruction is used to first attempt to get the lock (usually cmpxchg or xchg with lock prefix). If the lock is not obtained, the code will spin on a read of the locking memory space in an attempt to determine when the lock is freed by another thread. This read is referred to as a “volatile read” since it would involve a volatile type in C programming language. Volatiles have several rules associated with them including the fact that they cannot be updated in registers, and must be manipulated by utilizing their memory address. In the example below, we are first attempting to get the lock. In this example an achieved lock on a mutex would show a 0 as the result of an xchg operation.

If the lock if not obtained, then the code spins on a dirty read on the volatile data type. Then a test is done is see if the variable is 0 or the lock is open. If the lock results to 0, then the thread will attempt to again atomically obtain the lock and jump to the protected code to execute.

if (GETLOCK(lock) != 0)

{

While (VARIOUS_CRITERIA)

{

_asm pause;  // pause instruction in spin-loop

if ((volatile int*)lock) == 0) // spin on dirty read, not on lock

{

if (GETLOCK(lock)) == 0)

{

goto PROTECTED_CODE;

}

}

}

BACK_OFF_LOCK(lock); // back off lock or do other work

}

PROTECTED_CODE:

 

Figure 6: Efficient spin-wait loop for a atomic lock. Note the back-off code and the spin on volatile read instead of spinning on a lock.

Backing Off Locks with Spin-Wait Loops

Utilizing user-land spin-wait loops are dangerous for several reasons. The OS has no way of knowing that the processor is not doing useful work in the spin loop. Therefore it will allow the thread to continue spinning until it uses up its quanta. A quantum is the timeslice that is given to each thread running on a processor. This issue becomes worse on the Windows server platforms where the thread quantum is higher. Not only will this greatly increase the CPU utilization and the power wasted by the processor, but it will also negatively impact the performance of the application since the thread holding the lock may not be able to acquire a processor in order to release the lock. When designing a spin-wait loop it is important to give each thread the ability to back off the lock. This ensures that there are not too many threads aggressively pursuing the same lock at once.

Backing off of a lock can be accomplished in different ways. The simplest method is just a counted lock acquired where an application can try to obtain a lock a certain number of times and then backs off the lock by requesting a context switch. The attempt count can easily be adjusted, but this method usually does scale with the number of hardware threads. The wall clock time of the spin will change as well with processor frequency which makes this code difficult to maintain. An exponential back-off of the lock will try to obtain a lock frequently at first but assumes that a longer wait means that the thread is less likely to obtain this lock. Exponential back-off spin-waits are more scalable. This second method is commonly used inside of Windows networking [1].

Figure 7 shows the impact of an inlined assembly xchg lock with and without a back-off in its spin-wait loop. Notice that the lock containing a back-off greatly outperforms the lock with no back-off. This is usually due to the occasional occurrence where a thread is context sw itched out holding the lock. It has to wait until its turn on a processor to release the lock.



Figure 7: Lock with back-off vs. a lock without back-off on the ThreadID locking kernel.

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Figure 8 shows the difference in time when spinning on the lock versus spinning on a volatile read for xchg.



Figure 8: Lock with back-off vs. a lock without back-off on the threadID locking kernel.

One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.



Figure 9: Comparing xchg and cmpxchg on the ThreadID locking kernel.

Conclusion

Utilizing the appropriate lock in various scenarios is important to ensure performance and scalability in the Microsoft Windows environment. Utilizing WaitForSingleObject results in an expensive call to the Windows kernel even if there is no contention on the lock. This type of lock should be avoided if it will be called frequently and there is a small amount of contention on the lock. WaitForSingleObject can therefore be dangerous for granular locking. EnterCriticalSection attempts to obtain a lock in user-level first and then jumps into the kernel only if there is contention on the lock. To help developers avoid writing their own spin-wait loops, Microsoft added TryEnterCriticalSection call which will attempt to get the user-level lock in a spin before jumping into the kernel.

In order to understand a user-level spin-wait loop we have the basic construct and performance recommendations to this whitepaper. The back-off and spinning on a volatile read is critical to spin-wait performance.

References

[1] Microsoft Windows 2000 TCP/IP Implementation Details*

[2] Chynoweth, Michael, Implementing Scalable Atomic Locks for Intel® EM64T or IA32 Architectures.


Categories:
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Multi-threading software applications is critical for increasing performance for Intel’s Core Architecture. Locking code can frequently be the most frequently run code in a multi-threaded application. Determining which methodology of locking to utilize can be as critical as identification of parallelism within an application.Some Window’s locking APIs have the possibility of jumping into the operating system kernel,Two different locking kernels are used to demonstrate the impact of locks representing different granularities. The first locking kernel emulates a scenario of locking and unlocking a dynamically linked list which would commonly be used to maintain a freelist by a memory manager. For a multi-threaded application, it would be necessary to first lock the list before a thread attempts to malloc or free memory. The second locking kernel represents a more granular lock in that it simply obtains the ThreadID of the thread which has obtained the lock, updates a global variable, and then releases the lock. The performance of the different locks under low and high contention scenarios is tested through scaling the number from 1 to 64 threads. Each thread obtains the lock 10,000,000 times performs an operation and then releases the lock. The OS timing has been changed for Windows XP from 10 ms to 1 ms for these experiments.most popular methods of locking on the Microsoft Windows platform are WaitForSingleObject and EnterCriticalSection. WaitForSingleObject is an overloaded Microsoft API which can be used to check and modify the state of a number of different objects such as events, jobs, mutexes, processes, semaphores, threads, or timers. One disadvantage of WaitForSingleObject is that it will always obtain a kernel lock, so it enters privileged mode (ring 0) whether the lock is achieved or not. This API also enters the Windows kernel even if a 0 timeout is specified. Another disadvantage of this method of locking is that it can only handle 64 threads attempting to place a lock on an object at once. The advantage of WaitForSingleObject is that it can be processed globEnterCriticalSection can be used by putting an EnterCriticalSection and LeaveCriticalSection API call surroundin g the critical section code. This API has the advantage over WaitForSingleObject in that it will not enter the kernel unless there is contention on the lock. If there is no contention on the lock, then the API will obtain the lock in the user space and return without entering privileged mode. If there is contention, then it will follow very similar paths as WaitForSingleObject within the kernel. Under circumstances of low contention EnterCriticalSection is a much cheaper lock since it does not enter the kernel.
enables this API to be used for synchronization between processes. It also has the advantage of giving the OS knowledge of the locking object allowing for fairness and priority inversion. The disadvantages are that EnterCriticalSection cannot be processed globally, and there is no guarantee on the order which threads obtain the lock. EnterCritical section is a blocking call, meaning that it will not return until a thread can gain access to the critical section. Windows has introduced TryEnterCriticalSection, which is non-blocking and will return immediately with the lock achieved or not. EnterCriticalSection also allows for developer to initialize the critical section with a spin count which the thread will attempt to achieve the lock before backing off. This is accomplished through using the API InitializeCriticalSectionAndSpinCount. The spin count can be set in this call or in the registry to change the spin based on different operating systems and their different quanta for threads. The EnterCriticalSection and WaitForSingleObject API calls usually will not impact performance when locking for operations which cost thousands of cycles. In these cases the cost of the locking call itself will not be as prominent. Where negative performance can be incurred is on granular locking where the lock is achieved and released in hundreds of cycles. In these cases it can be beneficial to utilize a user level lock. Any time spent in the kernel (ntoskrnl.exe or ntkrnlpa.exe) when utilizing EnterCriticalSection is a sign that there is contention occurring. When there is no contention EnterCriticalSection and LeaveCriticalSection calls will spend most of the time in ntdll.dll in the functions RtlEnterCriticalSection and RtlLeaveCriticalSection respectively. NTDLL.DLL is a dynamically linked library running in ring 3 (non-privileged mode) of the processor. The library, NTDLL.DLL, contains much of the runtime library (RTL) code utilized by an application and should not be confused with the OS kernel. Once EnterCriticalSection has encountered contention it will follow very similar paths to WaitForSingleObject. User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. EnterCriticalSection utilizes atomic instructions to attempt to get a user-land lock before jumping into the kernel. On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address. In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance. Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memoryUtilizing user-land spin-wait loops are dangerous for several reasons. The OS has no way of knowing that the processor is not doing useful work in the spin loop. Therefore it will allow the thread to continue spinning until it uses up its quanta. A quantum is the timeslice that is given to each thread running on a processor. This issue becomes worse on the Windows server platforms where the thread quantum is higher. Not only will this greatly increase the CPU utilization and the power wasted by the processor, but it will also negatively impact the performance of the application since the thread holding the lock may not be able to acquire a processor in order to release the lock. When designing a spin-wait loop it is important to give each thread the ability to back off the lock. This ensures that there are not too many threads aggressively pursuing the same lock at once.

Backing off of a lock can be accomplished in different ways. The simplest method is just a counted lock acquired where an application can try to obtain a lock a certain number of times and then backs off the lock by requesting a context switch. The attempt count can easily be adjusted, but this method usually does scale with the number of hardware threads. The wall clock time of the spin will change as well with processor frequency which makes this code difficult to maintain. An exponential back-off of the lock will try to obtain a lock frequently at first but assumes that a longer wait means that the thread is less likely to obtain this lock. Exponential back-off spin-waits are more scalable. This second method is commonly used inside of Windows networking.
or if the lock extends beyond a cache line boundary splitting cache lines. Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

Emeritus Professor Koh Thiam Chun


Good day,

is there any similar trick for waiting on events? We are developing a complex application with multiple waitings on event. WaitForSingleObject takes noticeable time now. Could you suggest something?