Transactional Synchronization in Haswell

We have released details of Intel® Transactional Synchronization Extensions (TSX) for the future multicore processor code-named “Haswell”. The updated specification (Intel® Architecture Instruction Set Extensions Programming Reference) can be downloaded.

In this blog, I’ll introduce Intel TSX and provide a little background. Please refer to The Transactional Synchronization Extensions Chapter (Chapter 8) in the manual for additional information. These new synchronization extensions (Intel TSX) are useful in shared-memory multithreaded applications that employ lock-based synchronization mechanisms.

In a nutshell, Intel TSX provides a set of instruction set extensions that allow programmers to specify regions of code for transactional synchronization. Programmers can use these extensions to achieve the performance of fine-grain locking while actually programming using coarse-grain locks. I have written a simple illustrative example in my blog “Coarse-grained locks and Transactional Synchronization explained.”

Locks are a low-level programming construct (close to the hardware), so any discussion of Intel TSX will be low level too. How Intel TSX might affect higher-level programming methods, or enable new programming models, is beyond the scope of my blog but I will briefly comment on it at the end of this blog.

Why is this useful?

With transactional synchronization, the hardware can determine dynamically whether threads need to serialize through lock-protected critical sections, and perform serialization only when required. This lets the processor expose and exploit concurrency that would otherwise be hidden due to dynamically unnecessary synchronization.

At the lowest level with Intel TSX, programmer-specified code regions (also referred to as transactional regions) are executed transactionally. If the transactional execution completes successfully, then all memory operations performed within the transactional region will appear to have occurred instantaneously when viewed from other logical processors. A processor makes architectural updates performed within the region visible to other logical processors only on a successful commit, a process referred to as an atomic commit.

These extensions can help achieve the performance of fine-grain locking while using coarser grain locks. These extensions can also allow locks around critical sections while avoiding unnecessary serializations. If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each other’s data, then the threads can execute concurrently and without serialization. Even though the software uses lock acquisition operations on a common lock, the hardware is allowed to recognize this, elide the lock, and execute the critical sections on the two threads without requiring any communication through the lock if such communication was dynamically unnecessary.

Intel TSX Interfaces

Intel TSX provides two software interfaces. The first, called Hardware Lock Elision (HLE) is a legacy compatible instruction set extension (comprised of the XACQUIRE and XRELEASE prefixes) that are used to specify transactional regions. HLE is compatible with the conventional lock-based programming model. Software written using the HLE hints can run on both legacy hardware without TSX and new hardware with TSX. The second, called Restricted Transactional Memory (RTM) is a new instruction set interface (comprised of the XBEGIN, XEND, and XABORT instructions) that allows programmers to define transactional regions in a more flexible manner than is possible with HLE. Unlike the HLE extensions, but just like most new instruction set extensions, the RTM instructions will generate an undefined instruction exception (#UD) on older processors that do not support RTM. RTM also requires the programmer to provide an alternate code path for when the transactional execution is not successful.

In summary: “Intel Transactional Synchronization Extensions (Intel TSX) comes in two flavors: HLE and RTM. Hardware Lock Elision (HLE) is legacy compatible. Restricted Transactional Memory (RTM) offers flexibility but requires the programmer to provide an alternative code path for when transactional execution is not successful.”

The specification describes these extensions in detail and outlines various programming considerations to get the most out of them.

Intel TSX Applicability

Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.

However, if there is high data contention the algorithm would need to change in order to have an opportunity for high scalability. There are no magic bullets that can solve the problem, since true high data contention implies that the algorithm is effectively serialized.

Transactional Programming?

How Intel TSX might affect higher-level programming methods, or enable new programming models, is beyond the scope of my blog. Several experimental compiler implementations, not related specifically to Intel TSX, are available including gcc 4.7 which will have an experimental implementation. We can expect languages standards committees will be reviewing proposals on how to add transactional models at a language level (Intel has supported the creation of the Draft Specification of Transaction Language Constructs for C++). Intel TSX may enable a more efficient implementation of some transactional models than without Intel TSX. Much work remains to focus on real-world examples of usages and applications to develop and refine future usage. Good luck to all involved!

While Intel TSX may enable efficient implementations of new programming models, it does not require a new programming model and does not propose a new programming model. Intel TSX provides hardware-supported transactional-execution extensions to ease the development and improve the performance of existing programming models.

Summary



Intel TSX provides extensions that allow programmers to specify regions of code for transactional execution. Programmers can use these extensions to achieve higher performance with lesser effort, for example achieve fine-grain locking performance while programming with coarser-grain locks. This is a big help and therefore big news for programmers.

Please check out the specification and stay tuned for information about supporting tools from Intel and others in the coming months.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments


Hi,

Hi,

I run a simple test case with the xbegin and xend intructions provided by haswell. The test case just touched 20K continued memory bytes in the RTM protected region.(which is much smaller than the l1 cache size.) When using sdk, the test will completes without any abort event. But when I run it on a real haswell machine, it will incur a number of capacity aborts and work out after a number of retries. I want to ask in the real machine, what kind of event will cause the capacity abort except for the cache miss.

Thanks


The Intel C++ Compiler

The Intel C++ Compiler version 13.1 (and later) supports Haswell's TSX with intrinsics. You can read the documentation online (see http://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/). Look for topic “Intrinsics for Intel® Transactional Synchronization Extensions (Intel® TSX).” The STM compiler was an experimental compiler implementing more advanced constructs for a transactional programming model. The experiment was very helpful, and helped our work with others on a specification that can be joined online at http://groups.google.com/group/tm-languages. This has been taken up at the C++ standard committee for discussion and consideration. C++14 information can be found at http://isocpp.org – look for “SG5” (Transactional Memory: Exploring transactional memory constructs for potential future addition to the C++ language). We do not expect to do additional changes to the STM experimental compiler, nor to implement our experiments in the product. We prefer to work with the community to refine and push for standardization before we make it a product.


Will TSX allow for nested transactions and what is the semantics of this? I guess I should go read the manual...


This is indeed an interesting hardware feature. I have been working on software transactional memory for in-memory database. And now having read about Intel's h/w TM I have many curious questions ...
1. I wonder if there is any limitation on the usage of RTM, like if there the amount of memory used per transaction has some kind of limit or if there is some restriction on number of threads that can run simultaneously? I assume there is reasonable cost associated with each memory location accessed under a transaction and same goes for each transactional thread.
2. I also found that the features like "nested" transaction is supported by RTM. Who defines the semantics of such transactions? Is there any standard that has been followed?
3. Is there any performance impact (side effect) of using such instructions in our program--mainly to the threads that do no use transactional memory at all?
4. Does using RTM increases memory usage of the system, if yes how much? If no, then where does it stores the uncommitted transaction data (CPU cache)?

//itin/


James, could you comment on a potential unintended beneficial side effect of TSX.

At first glance this may seem odd, consider a single threaded application that enters a transaction (XACQUIRE or XBEGIN), from that point on until the exit (XRELEASE or XEND) all memory reads and writes (up to your buffering capacity) are effectively "sticky" with respect to the internal copies. IOW these locations behave similar to non-evictable L1 cache locations (be they in L1 or elsewhere). Further the the writes and read/modify/writes are write combining with respect to the internal storage as opposed to in RAM via the memory bus. What I am suggesting is, TSX can be used to improve performance of code that has no thread contention. Is this a valid assessment?

www.quickthreadprogramming.com


Some comments on comments...

(1) Are you advocating the use of TM where coarse grained locks are being used? "These extensions can help achieve the performance of fine-grain locking while using coarser grain locks..."

I'm advocating using it when coarse-grained locks are used, when the contention is due to lock contention and not data collision. If data collision is high, then fine-grained locking would not have helped, so this is not as likely to help. This coarse-grained vs. fine-grained way of thinking about it is a little simplistic, because it understates how effective it can be, as it is useful in other cases too.

I think of it this way: As a programmer, I'd like to use a LOCK around the WHOLE data structure that I want to update and UNLOCK when I'm done. This is easy and fool proof (assuming my whole update/transaction is in that data structure). If this creates contention, I may rewrite to use fine-grained locking around parts of the data structure to get good performance/scaling. Rewriting code like that is error prone and more work. It also is harder to document and make sure the next person who changes my code won't introduce a bug. With transactional synchronization, I simply LOCK the whole data structure and let the TSX hardware magically reduce the contention without me having to resort to recoding to use finer grained locking. Easier to write, easier to be bug free, easier to stay bug free. In general, this magic will not occur if I could not have gotten better performance by using finer grained locking.

(2) I was asked if I could give "hints on the size of the buffer" because this is critical.

This is an insightful question - yes, hardware will always have limits in how much "magic" it can provide. The size is an implementation detail that I can't elaborate on in great detail (detailed specifications are not ready for publication yet), and which will change from implementation to implementation. I can say it is big enough to be useful, and small enough to not do everything. I would not call it a tiny buffer.

(3) Finally - a concern "for software engineers it's another dimension of complexity!"

It does add to the bag a tricks a programmer has, but it solves a problem with less complexity than is needed without it. Getting both good performance and correctness is challenging. Transactional synchronization gives a programmer a key way to reduce contention without requiring more complex coding.


I enjoyed reading through your posting. I need to say that it was the 1st article on your blog I genuinely enjoyed and in which I had a feeling of agreement, know what I mean? Must be keep the posting and I will be back once again.


Certainly a great moment in technology as TM is a paradigm shift. And the most popular processor vendor is going to release it for general purpose use!

But for software engineers its another dimension of complexity! Coarse grain, fine grain, atomic, tile based and now transactional parallel algorithm. If I choose the best technology, how do I hire the skilled engineer.


It's really great improvement toward easier parallel programming. I looked the updated instruction manual, but it doesn't explicitly mention the size of internal buffer for RTM. Although it's an implementation specific, could you give me some hints on the size of the buffer? This is important to figure it out how long and large TM can be supported by RTM.


Pages