Exploring Intel® Transactional Synchronization Extensions with Intel® Software Development Emulator

Intel® Transactional Synchronization Extensions (Intel® TSX) is perhaps one of the most non-trivial extensions of instruction set architecture introduced in the 4th generation Intel® Core™ microarchitecture code name Haswell. Intel® TSX implements hardware support for a best-effort “transactional memory”, which is a simpler mechanism for scalable thread synchronization as opposed to inherently complex fine-grained locking or lock-free algorithms. The extensions have two interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). 

In this blog I will show how you can write your first RTM code and execute it in an emulated environment now, without waiting until the 4th generation Intel® Core™ processors become available for purchase.

Before diving in, please make sure you have a basic understanding of the new RTM instructions. I refer you to this blog as an introduction. Check out also the Intel Developer Forum’12 presentation by Ravi Rajwar&Martin Dixon discussing the details of Intel TSX implementation in Haswell hardware and a presentation by Andi Kleen on adding lock elision (also using RTM) to Linux.

My plan was to write a toy bank account processing application using popular C++ thread-unaware data structures from STL with concurrent access to bank records managed by Intel TSX. This way the implementation should be very simple, thread-safe and scalable.

Development Environment

For this experiment one needs the newest version (later than 5.3.1) of Intel® Software Development Emulator (Intel® SDE) and a compiler that can generate RTM instructions (via intrinsics or direct machine code). Please note that performance measurements with Intel SDE running RTM are of limited value because the overhead of emulating TM in software instead of using real hardware is huge, but as you will see later Intel SDE can already demonstrate important points for RTM usage for concurrency library developers and application programmers.

Since my laptop runs Windows I decided to try Intel SDE/RTM on Windows. I have chosen the C++ compiler from “Microsoft Visual Studio 2012 for Windows Desktop” (there is a free “Express” version that works for my purpose too). With a few clicks I quickly setup a console application project and included immintrin.h header the main .cpp file to use RTM intrinsics.

The Test

As a bank account structure the simple std::vector<int> from C++ standard template library has been chosen. “Accounts[i]” stores current account balance for account number i. This is very simple and popular but thread-unsafe data structure which must be protected by concurrency control mechanisms for parallel access. Usually locks/mutexes are used to limit the number of threads accessing the structure simultaneously. However, for parallel write accesses the whole data structure usually is locked exclusively even if distinct parts of it have to be updated. Intel TSX should help here since it can optimistically execute writes, and if there is no real data conflict happening, the writes are committed without serializing.

To simplify the operations on the accounts I wanted to implement an easy-to-use C++ wrapper for protecting the current C++ scope from unsafe concurrent access to the data:


{

        std::cout << "open new account" << std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.push_back(0);

}

{

        std::cout << "open new account" << std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.push_back(0);

}

{

        std::cout << "put 100 units into account 0" <<std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts[0] += 100; // atomic update due to RTM

}

{

        std::cout << "transfer 10 units from account 0 to account 1 atomically!" << std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts[0] -= 10;

        Accounts[1] += 10;

}

{

        std::cout << "atomically draw 10 units from account 0 if there is enough money"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        if(Accounts[0] >= 10) Accounts[0] -= 10;

}

{

        std::cout << "add 1000 empty accounts atomically"<< std::endl;

        TransactionScope guard; // protect everything in this scope

        Accounts.resize(Accounts.size() + 1000, 0);

}

Legacy applications implement such guards using a lock that allows only a single writer to execute the critical section (read-write locks are more complicated to handle and also do not make much sense here in our case because all accesses are writes/updates):


class TransactionScope

{

        SimpleSpinLock & lock;

        TransactionScope(); // forbidden

public:

        TransactionScope(SimpleSpinLock & lock_): lock(lock_) { lock.lock(); }

        ~TransactionScope() { lock.unlock(); }

};

Implementing and Testing with RTM

A naive RTM implementation for TransactionScope (handling both read/lookup and write/update accesses transparently) would be (changed lines are marked with ):


class TransactionScope

{

public:

        TransactionScope()

{

█               int nretries = 0;

█               while(1)

█               {

█                       ++nretries;

█                       unsigned status = _xbegin();

█                       if(status == _XBEGIN_STARTED) return; // successful start

█                       // abort handler

█                       std::cout << "DEBUG: Transaction aborted "<< nretries <<

█                          " time(s) with the status "<< status << std::endl;

█               }

        }

█       ~TransactionScope() { _xend(); }

};

 

I have successfully compiled this code and tried to run it through Intel SDE:


./sde-bdw-external-5.31.0-2012-11-01-win/sde.exe -hsw -rtm-mode full -- ./ConsoleApplication1.exe

open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

DEBUG: Transaction aborted 2 time(s) with the status 0

DEBUG: Transaction aborted 3 time(s) with the status 0

DEBUG: Transaction aborted 4 time(s) with the status 0

DEBUG: Transaction aborted 5 time(s) with the status 0

DEBUG: Transaction aborted 6 time(s) with the status 0

DEBUG: Transaction aborted 7 time(s) with the status 0

DEBUG: Transaction aborted 8 time(s) with the status 0

DEBUG: Transaction aborted 9 time(s) with the status 0

DEBUG: Transaction aborted 10 time(s) with the status 0

DEBUG: Transaction aborted 11 time(s) with the status 0

DEBUG: Transaction aborted 12 time(s) with the status 0

DEBUG: Transaction aborted 13 time(s) with the status 0

DEBUG: Transaction aborted 14 time(s) with the status 0

DEBUG: Transaction aborted 15 time(s) with the status 0

DEBUG: Transaction aborted 16 time(s) with the status 0

and so on…

The program went into infinite loop always aborting on the first transaction. The RTM debug log from Intel SDE (emx-rtm.txt) also confirmed that (used option “-rtm_debug_log 2”). Well, a general rule is that failure is more or less expected for any implementation that ignores specification… Intel® Architecture Instruction Set Extensions Programming Reference explicitly mentions that “the hardware provides no guarantees as to whether an RTM region will ever successfully commit transactionally”. Because of that the software using RTM must provide (non-transactional) fall-back path that is executed if (many) aborts are happening (By the way: HLE provides the fall-back automatically, since on the first abort, the same critical section is executed non-transactionally).

Implementing Fall-Back

Here is our second attempt that acquires a fall-back spin lock non-transactionally after specified number of retries.


LONGLONG naborted = 0; // global abort statistics, alternatively use “–rtm_debug_log 2” Intel SDE option

 

class TransactionScope

{

█       SimpleSpinLock & fallBackLock;

        TransactionScope(); // forbidden

public:

█       TransactionScope(SimpleSpinLock & fallBackLock_, int max_retries = 3) :

█               fallBackLock(fallBackLock_)

        {

                int nretries = 0;

                while(1)

                {

                        ++nretries;

                        unsigned status = _xbegin();

                        if(status == _XBEGIN_STARTED)

                        {

█                               if(!fallBackLock.isLocked())

█                                         return; // successfully started transaction

█                               /* started transaction but someone is executing 

█                                  the transaction section non-speculatively (acquired

█                                  the fall-back lock) -> aborting */

█                               _xabort(0xff); // abort with code 0xff

                        }

                        // abort handler

                        InterlockedIncrement64(&naborted); // do abort statistics

                        std::cout << "DEBUG: Transaction aborted "<< nretries <<

                              " time(s) with the status "<< status << std::endl;

█                       // handle _xabort(0xff) from above

█                       if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff

█                            && !(status & _XABORT_NESTED))

█                       {       // wait until the lock is free

█                               while(fallBackLock.isLocked()) _mm_pause();

█                       }

█                       // too many retries, take the fall-back lock

█                       if(nretries >= max_retries) break;

                }

█               fallBackLock.lock();

        }

        ~TransactionScope()

        {

█               if(fallBackLock.isLocked())

█                       fallBackLock.unlock();

█               else

                        _xend();

        }

};

The output looks much better now:


open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

DEBUG: Transaction aborted 2 time(s) with the status 0

DEBUG: Transaction aborted 3 time(s) with the status 0

open new account

put 100 units into account 0

transfer 10 units from account 0 to account 1 atomically!

atomically draw 10 units from account 0 if there is enough money

add 1000 empty accounts atomically

 

One can see that all transaction except the first one succeeded on the very first attempt. The first one took the fall-back lock after three attempts. It was special since it had to reserve and touch new memory for the vector from the operating system. This is a very complex process involving system calls, privilege ring transitions (ring 3 [application]->ring 0 [OS]), page faults and initialization/zeroing of very big chunks of memory which may not fit into the transactional buffer. All this may cause aborts according to the Intel® Architecture Instruction Set Extensions Programming Reference.

Leveraging RTM Abort Status Bits

A further optimization that I came up with is leveraging the abort status information: in case of such “hard” aborts the “retry” bit (position 1) in the abort status is not set. The bit is set if hardware thinks the transaction may succeed on retry. I added the line marked below in the abort handler to implement it:

 


 // handle _xabort(0xff) from above

 if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff

      && !(status & _XABORT_NESTED))

 {

        while(fallBackLock.isLocked()) _mm_pause(); // wait until lock is free

 

█} else if(!(status & _XABORT_RETRY)) break; /* take the fall-back lock

    if the retry abort flag is not set */

 

The output:


open new account

DEBUG: Transaction aborted 1 time(s) with the status 0

open new account

put 100 units into account 0

transfer 10 units from account 0 to account 1 atomically!

atomically draw 10 units from account 0 if there is enough money

add 1000 empty accounts atomically

 

Now we see that the program makes faster progress by taking the fall-back lock sooner in the case of a “hard” abort.

As you may notice, the changes so far were isolated within some synchronization interface, TransactionScope. The application code was not changed. As generally available TSX software infrastructure evolves in future you should look for a proven existing library that has (scope) locks with RTM support to avoid pitfalls in your synchronization primitives (we will talk about pitfalls in application code in future blogs). For example a TSX-enabled pthread library for Linux is already available. On the other hand, it is not uncommon for existing applications to use an extended or custom synchronization interfaces, converting them to take advantage of TSX is not a complicated task either if done with care.

Concurrent Accesses from Several Threads Managed by Intel TSX

 

After basic debugging the time has come to see the real power of Intel TSX: run two worker threads doing random concurrent updates to the central account data structure:


unsigned __stdcall thread_worker(void * arg)

{

        int thread_nr = (int) arg;

        std::cout << "Thread "<< thread_nr<< " started." << std::endl;

        // create thread-local TR1 C++ random generator from <random>

        std::tr1::minstd_rand myRand(thread_nr); 

        long int loops = 10000;

 

        while(--loops)

        {

                {

                        TransactionScope guard(globalFallBackLock);

                        // put 100 units into a random account atomically

                        Accounts[myRand() % Accounts.size()] += 100;

                }

 

                {

                        TransactionScope guard(globalFallBackLock);

                        /* transfer 100 units between random accounts 

                           (if there is enough money) atomically */

                        int a = myRand() % Accounts.size()

                        int b = myRand() % Accounts.size();

                        if(Accounts[a] >= 100)

                        {

                                Accounts[a] -= 100;

                                Accounts[b] += 100;

                        }

                }

        }

        std::cout << "Thread "<< thread_nr<< " finished." << std::endl;

        return 0;

}

 

I built Release build without DEBUG output and see that there are only about 100-300 aborts for the total of 20000 transactions. Debug output says that the abort flag status is 6: retry and “memory access conflict” bits are set. This is exactly what I expected from Intel TSX: almost all updates are done in parallel and only a few have been rolled back due to a conflict.

To double check if my conclusions are right and emulator works as I expected I added an increment/update of a global counter in the transactions to introduce a huge number of conflicting accesses. And yes, it worked: with that change I have seen about 5-15K aborts. Although the absolute numbers obtained from the RTM emulator are not able to exactly predict the execution metrics on future hardware, the orders of magnitude should still indicate possible issues with RTM usage.

Last Words

These were my experiences with RTM and the new Intel® Software Development Emulator. Get prepared for Haswell and check out how your software can use Restricted Transactional Memory with Intel SDE now!

--

Roman 

(the complete source code is attached to the article)

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.