Using HLE and RTM with older compilers with tsx-tools

To use HLE/RTM to improve lock scalability the lock library needs to be enabled. If you already have an enabled lock library, like glibc on Linux, you can just use normal locking with that library. If the lock library doesn't support it or you have your own lock the library needs to be enabled, like discussed in Roman's blog post. Enabling the lock library requires using the RTM instructions or the HLE prefixes. Newer compilers -- like gcc 4.8 -- support intrinsics for HLE and RTM.


 #include <immintrin.h>
if (_xbegin() == _XBEGIN_START) {
/* transaction */
} else {
/* fallback path -- take lock */


 while (__atomic_exchange_n(lock, 1, __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE) != 0) { 
int val;
/* Wait for lock to become free again before retrying. */
do {
/* Abort speculation */
__atomic_load_n(lock, &val, __ATOMIC_CONSUME);
} while (val == 1);

Now older compilers do not support these intrinsics directly. tsx-tools has some compatibility headers that implement them with older gcc and compatible (clang, icc, sun cc etc.) compilers. This allows to use TSX without updating the compiler.


Provides the standard TSX intrinsics _xbegin(), _xend(), _xtest(), _xabort()


Same as rtm.h (compat name)


Alternative unofficial RTM intrinsics implementation for gcc 4.6 with asm goto support (Fedora) or gcc 4.7+ This saves a few instruction for every transaction setup by exposing the jump to the abort handler to the programmer. Useful for people who care about micro optimizations.


An emulation of the gcc 4.8+ HLE atomic intrinsics These are similar in spirit, but do not fully match the intrinsics.

gcc 4.8+ implements HLE as an additional memory ordering model for the C11+ atomic intrinsics. gcc has its own flavour which are similar to C11, but use a different naming convention. We cannot directly emulate the full memory model.

So the operations are mapped to __hle_acquire_ and __hle_release_ without an explicit memory model parameter.

The other problem is that C11 atomics use argument overloading to support different types. While that would be possible to emulate it would generate very ugly macros. We instead add the type size as a postfix.

So for example:

 int foo; __atomic_or_fetch(&foo, 1, __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE) 


 __hle_acquire_or_fetch4(&foo, 1); 

Also C11 has some operations that do not map directly to x86 atomic instructions. Since HLE requires that a single instruction starts a transaction, we omit those. That includes nand, xor, and, or. While they could be mapped to CMPXCHG this would require a spin loop, which is better not done implicitely. There is also no HLE load.

x86 supports HLE prefixes for all atomic operations, but not all can currently be generated in this scheme, as many operations have no support for fetch.

A real compiler could generate them by detecting that the fetch value is not used, but we don't have this luxury. For this we have non _fetch variants. These also support and, or, xor (but not nand), as a extension.

Intrinsics for sbb, adc, neg, btr, bts, btc are not supported.

We also don't implement the non _n generic version of some operations.

Available operations

(8 only valid on 64bit)

__hle_{acquire,release}_{add,sub,or,xor,and}{1,2,4,8} (extension)
__hle_{acquire,release}_test_and_set{1,2,4,8} (sets to 1)

gcc 4.8 atomic documentation


An emulation of the Microsoft compiler HLE intrinsics for gcc.

For more complete information about compiler optimizations, see our Optimization Notice.



(I know that the post is now quite old)

For RTM it should be _XBEGIN_STARTED (not _XBEGIN_START).

Please try it on a real application with real critical sections that do some work. I don't think your test is testing anything useful.

thx for sharing the above post!

I wrote a quick test with a loop over lock;xchgl and movl with and without HLE prefixes.
To my surprise, the version with HLE prefixes seems to be ~50% slower?
Is the test invalid/irrelevant for some reason?
Am I doing something wrong or is this expected?


The test was run on a MacBook Air with an i7-4650U 1.7 GHz (Haswell) CPU

tsx-tools reports:
Rolfs-MacBook-Air:tsx-tools ran$ ./has-tsx
RTM: Yes
HLE: Yes
Rolfs-MacBook-Air:tsx-tools ran$

code snippet enclosed below, compiled with:
Rolfs-MacBook-Air:ran ran$ clang -O4 -o tt tt.c -lc

Rolfs-MacBook-Air:ran ran$ time ./tt 1 100000000

real 0m1.616s
user 0m1.612s
sys 0m0.004s
Rolfs-MacBook-Air:ran ran$ time ./tt 2 100000000

real 0m1.063s
user 0m1.061s
sys 0m0.002s
Rolfs-MacBook-Air:ran ran$

Source code for tt.c:



typedef unsigned int u32;

#define __HLE_ACQUIRE ".byte 0xf2 ; "
#define __HLE_RELEASE ".byte 0xf3 ; "

static inline void __hle_xchg (volatile u32* addr)
u32 value = 1;

asm volatile (__HLE_ACQUIRE "lock; xchgl %1,%0"
: "+r" (value), "+m" (*addr)
:: "memory");

static inline void __hle_move (volatile u32* addr)
asm volatile (__HLE_RELEASE "movl $0,%0"
: "+m" (*addr) :: "memory");


static inline void __raw_xchg (volatile u32* addr)
u32 value = 1;

asm volatile ("lock; xchgl %1,%0"
: "+r" (value), "+m" (*addr)
:: "memory");

static inline void __raw_move (volatile u32* addr)
asm volatile ("movl $0,%0"
: "+m" (*addr) :: "memory");


static void do_hle (int count)
u32 data = 0;
int p1;

for (p1 = 0 ; p1 < count ; ++ p1)
__hle_xchg (& data);
__hle_move (& data);


static void do_raw (int count)
u32 data = 0;
int p1;

for (p1 = 0 ; p1 < count ; ++ p1)
__raw_xchg (& data);
__raw_move (& data);


int main (int argc, char* argv [])
int test = argc < 2 ? 1 : atoi (argv [1]);
int count = argc < 3 ? 1 : atoi (argv [2]);

switch (test)
case 1: do_hle (count); break;
case 2: do_raw (count); break;

return 0;

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.