Intel recently released the 4th Generation Intel® Core™ processors, which have Intel® Transactional Synchronization Extensions (Intel® TSX) enabled. Intel TSX can improve the performance of applications that use lock-based synchronization to protect data structure updates. This feature allows multiple non-conflicting lock-protected changes to data to occur in parallel.
I’ve found over the years one of the best ways to improve parallel performance is to get rid of coarse-grained locks, replacing them with multiple fine-grained locks to protect the structure being modified. While this technique can yield better performance, it is difficult to get right, and increases the number of locks that must be acquired to perform an operation. There may also be subtle performance problems if multiple locks occupy the same cache line; though the changes themselves may not overlap, locking adjacent mutexes in the same cache line will result in false sharing.
Transactional memory addresses the problem a different way, by allowing multiple threads to access or update the protected data, and guaranteeing the updates appear atomically to all other threads. This gives some of the benefits of fine-grained locking without having to make changes to the code beyond replacing the locks.
The two interfaces for Intel TSX are Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM).
For more details on this technology please see the Intel Architecture Instruction Set Extensions Programming Reference.
In Intel® Threading Building Blocks (Intel® TBB) 4.2 release, a new mutex class, the
speculative_spin_mutex, is implemented with the use of HLE interface. This lock is very simple to use; when a
speculative_spin_mutex is locked, with HLE the processor starts a transaction. The visible state of the lock is not modified, and any locations in memory that are modified will not appear changed to other processors until the mutex is unlocked. When the unlock happens all the changes made under the lock are committed atomically, i.e. they all appear in memory simultaneously as far as other processors can tell. If a conflict occurs, either because the code under the lock sees a change to data it has read or written or the lock is taken non-speculatively by another thread, the changes are discarded and execution starts over at the point the lock was taken. In a simple use of HLE the hardware would then take the lock non-speculatively, but our story is a little different.
speculative_spin_mutex is implemented as a fetch-and-store. If a transaction is aborted the thread will go back to the lock acquisition step. The thread will attempt to acquire the lock non-speculatively and if the acquire succeeds it will continue executing non-speculatively, i.e. all other threads will see the lock as unavailable. This is the standard hardware implementation. However, the thread could find that another thread already holds the lock. In this case, the Intel® TBB implementation will begin a spin-wait on the lock and when it becomes available the thread will re-attempt speculative execution.
The HLE interface does not require an explicit fallback path; the code protected by the lock can be executed speculatively or under a “taken” lock. The results will be the same. This also dictates the limitation that the state of the lock must be the same after the lock is released as before it is taken. This is the case for a
spin_mutex. Other kinds of locks such as ticket locks are not reset to their state prior to locking, so they are difficult to implement with HLE.
One concern for speculative mutexes is false sharing. Concurrent writes to the cache line occupied by a speculative lock will abort its transactions. This is true even if the write is to some other variable that shares the cache line with the lock.
To avoid this problem,
speculative_spin_mutex occupies two cache lines, because memory allocators and stack maintenance do not guarantee where in a cache line an object falls. The
spin_mutex used by
speculative_spin_mutex is guaranteed to fall at the start of a cache line, and no other object will occupy that cache line.
For processor architectures not compatible with HLE, the
speculative_spin_mutex is implemented as
tbb::spin_mutex with cache line padding as I just described.
As always with a new feature, your mileage may vary. Careful performance testing will tell whether
speculative_spin_mutex is right for you.
The new mutex class with lock elision begins the story of transactional memory support in Intel TBB. Future updates will add more; e.g. soon we plan to release the RTM-based speculative_spin_rw_mutex as a Community Preview Feature. I’ll talk about speculative_spin_rw_mutex in another post once it is available.
For help optimizing your program with Intel TSX, you should consult the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Chapter 12.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804