Coarse-grained locks and Transactional Synchronization explained

Coarse-grained locks, and the importance of transactions, are key concepts that motivate why Intel Transactional Synchronization Extensions (TSX) is useful.  I’ll do my best to explain them in this blog.

In my blog "Transactional Synchronization in Haswell," I describe new instructions (Intel TSX) that will improve the performance of coarse-grained locks.  Understanding coarse-grained locks and the concept of transactions are both key to understanding why Intel TSX matters.

Intel TSX may enhance performance of mutual exclusion other than simple coarse-grained locks, but I will focus on coarse-grained locking because it is common and Intel TSX allows highly concurrent accesses using only a simple locking mechanism.

An example

To motivate by illustration, let's consider a simple hash table. Hash tables are used to map a key to a key and value pair in linear time. Two key operations are add (insert) and lookup (retrieve). Resizing and deletion are two additional operations of general interest also, but I will leave them for another time.

Designing a highly concurrent hash table is a non-trivial task, and there are many approaches to allow high levels of concurrency. All these approach add complexity to the program, and often to the data structures themselves.

The simplest approach is a single lock approach. In such an approach, every operation on the hash table starts by obtaining the lock for the table and concludes by releasing the lock. While the lock is held for the operation, no other task on the system can obtain the lock and therefore no hash table operation is allowed to proceed.

Considering Figure 1, no concurrent operations are allowed, so each of the five operations shown would occur one at a time.

Figure 1: Five hash table operations requested


A common solution is to break the hash table into smaller regions, and have locks that apply to regions. While this can reduce contention, it still can create needless delays and it definitely complicates the coding and the data structure.

Such an approach is a prime example of taking a coarse-grained lock (a single lock for the entire hash table) and working to make it a finer grained lock (multiple locks for smaller table sections). Coarse-grained locks are easier to use, easier to understand and easier to debug. The only disadvantage is that they tend to impede performance in a multithreaded environment. Multicore processors are increasing the likelihood of this being a problem, and help motivate new hardware assistance so that programming has a chance to stay simple more often than without assistance.

Transactional Synchronization (Intel TSX) as a solution

What would be ideal, is to use the single lock (coarse-grained locking) because it is easy and not very error prone, but still have the performance of a fine-grained implementation. In our Figure 1 example, only one operation conflicts with another. This example does have more conflicts that would be expected in a real world example.

Considering this example, three of the operations have no collision with the other operation so the use of HLE (part of Intel TSX) on the single lock will completely elide the lock. In other words, the performance is very close to the performance of the code if no locking or unlocking code was present. The key however is that the operations are protected by the Intel TSX hardware, which has silently ensured that the protection intended by the lock is indeed assured.

The two operations that map to the same hash table entry will need to be staggered. This will occur even if we are unlucky enough to have them happen at the same time. In such a case, the Intel TSX will detect that the lock was indeed needed and some locking overhead will be incurred. What would actually happen in such a case, is that the colliding tasks will proceed into the protected code until the processor detects the conflict. As such a point, both updates will abort their protected code (also called the transaction). The most common solution then is to have each task proceed but actually enforce the lock on the second try. This means that one task will win, and delay the other, until the operation is complete. The precise decision on how to handle the collision is either up to the processor implementation with HLE, or the programmer with RTM. The processor implementation for HLE will also be fairly simple and conservative, in order to preserve the semantics of the original lock and hence compatibility with processors that lack Intel TSX.


For a hash map, Intel TSX allows for the right things to occur without losing the protection that the locks need to give. Intel TSX ensures the same results as the coarse-grained lock guarantees, but allows unrelated operations to proceed without delays that the coarse-grained locks would have caused. For more information on Transactional Synchronization, see my blog on Intel TSX.

Please check out the specification and stay tuned for information about supporting tools from Intel and others in the coming months.

Image icon slide1.png156.72 KB
For more complete information about compiler optimizations, see our Optimization Notice.



As I understand what you said, TSX can make coarse grained hashmap as fast as Java lock-striping concurrent hashmap that was implemented by Doug Lee.
However, I don't agree with you, because TSX may need mfence on XRELEASE in order to check conflict.
The coarse grained hashmap using TSX always calls mfence so I think we will need to implement lock-striping concurrent hashmap as ever despite of TSX.

JimDempseyAtTheCove's picture

One of the advantages of HLE and RTM locks not highlighted in your post, or in the reference specification, is this reduces (if not eliminates) the need for Wait-Free programming. Use of HLE or RTM removes the nasty side effect of preempting a thread holding a lock, which is principally the reason for writing code in Wait-Free format. (Wait-Free avoids lock held for duration of preemption.)

In the reference manual:
8.3.3 Requirements for HLE Locks
An XRELEASE prefixed instruction must restore the value of the elided lock to the
value it had before the lock acquisition.

While following this rule different sections of a protected structure can be concurrently updated, as shown in your example (provided non-conflicting updates).

Why not permit an XRELEASE performed to a different value to cause the first such release to win and all subsequent XRELEASE (XEND) to abort?

An example of this might be an XACQUIRE that sets a lock flag in the lsb of a pointer, then using the pointer to the object to reference the object. After processing, the XRELEASE may need to restore the pointer to a different value (say different pointer or NULL or different state of lock). This may occur in linked list management. While you could say in these cases do not use HLE/RTM then this exposes the problem of thread preemption.

jamesreinders's picture

Correct, if a collision happens and the lock must actually be taken (not elided) then stalls occur. That will happen a certain percentage of time and the trick is to keep that percentage low. Reading the root pointer is safe if reading is all that is happening. Resizing, yeap... future blog. :-)

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.