Fun with Intel® Transactional Synchronization Extensions

By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making the lives of developers who need to write synchronization codes for concurrent and parallel applications easier. It comes in two flavors: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). If you haven’t read the background, go and do so now, since from here on, I assume that you have that basic knowledge.

I had been developing a PIN-based emulator for Intel TSX for the past few years. The emulator is now integrated into Intel Software Development Emulator. During the development, I had a lot of grins and grimaces with respect to HLE/RTM. I would like to share three such particularly memorable incidents.

The Incidents

Example 1.

The following codelet is a part of a test program a colleague of mine wrote who wanted to learn how to use RTM. With the array ‘data’ containing integer values and the array ‘group’ mapping the data’s elements to the slots in the array ‘sums’, the test program tries to store the sum of the data belonging to a group in the corresponding slot in the array ‘sums’. Since multiple threads may access the same slot simultaneously, each addition is performed in an RTM transaction. When a transaction aborts, the thread re-executes the addition in the critical section along the fallback path (i.e. ‘else’). Do you think it is correct? If you don’t, can you spot what is wrong?

#pragma omp parallel for
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
              sums[mygroup] += data[i];
            _xend();
          } else {
              #pragma omp critical
              {
                  sums[mygroup] += data[i];
              }
          }
      }

Example 2.

I was taught code reuse is imporant when I was in school (sorry, not in the kindergarten ;^)). So, I decided to put to work what I learned when a need arose to write an RTM test. The test was similar to the one in Example 1, except that this test alternates RTM and HLE transactions. (Notice that the test does not have the non-speculative fallback path required for the RTM transaction. Having no fallback path makes the test UNSAFE because Intel TSX does not guarantee forward-progress; i.e., it can abort RTM transactions forever.) The test has two addition statements: one is protected with RTM and the other is protected with HLE. Quite a feat, eh? I felt proud of myself ;-) ... until I started to run the test. The test occasionally printed out incorrect sums. I panicked at first because the test was simple and looked almost identical with other tests, leading me to believe, however briefly, that the emulator had a nasty bug that had hidden unnoticed for a long time. But after a closer look, I realized the test had a flaw. Can you see what I did wrong?

    #define PREFIX_XACQUIRE ".byte 0xF2; "
    #define PREFIX_XRELEASE ".byte 0xF3; "
 
    class mutex_elided {
      uint8_t flag;
      inline bool try_lock_elided() {
        uint8_t value = 1;
        __asm__ volatile (PREFIX_XACQUIRE "lock; xchgl %0, %1"
                : "=r"(value),"=m"(flag):"0"(value),"m"(flag):"memory" );
        return uint8_t(value^1);
      }
    public:
      inline void acquire() {
        for(;;) {
            exponential_backoff backoff;
            while((volatile unsigned char&)flag==1)
                backoff.pause();
            if(try_lock_elided())
                return;
            __asm__ volatile ("pause\n" : : : "memory" );
        }
      }
 
      inline void release() {
        __asm__ volatile (PREFIX_XRELEASE "movl $0, %0"
               : "=m"(flag) : "m"(flag) : "memory" );
      }
    };
    ...
 
 
      mutex_elided m;
#pragma omp parallel for
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) ;
            // must have a fallback path
            sums[mygroup] += 1;
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += 1;
            m.release();
        }
    }

Example 3.

A colleague of mine tried to use RTM to improve performance of a benchmark. (I changed function names for clarity.) The following fragment of the benchmark permutes an array of IDs by, for each ID, swapping its value with that of a randomly picked partner. In the fallback path, elements i and j are exclusively acquired in the increasing order of their indices, and then written back in the reverse order. He was running it on the emulator and came back to me with an occasional hang problem. Can you come up with a sequence of events that leads to an indefinite wait?

bool pause( volatile int64_t* l ) {
    __asm__ __volatile__( "pause\n" : : : "memory" );
    return true;
}
 
int64_t read_and_lock( volatile int64_t* loc ) {
    int64_t val;
    while(1) {
        while( pause( loc ) )
            if(  empty_val != (val = *loc) )
                    break;
        assert( val!=empty_val );
        if ( __sync_bool_compare_and_swap( loc, val, empty_val ) )
            break;
    }
    assert( val!=0 );
    return val;
}
 
void write_and_release( volatile int64_t* loc, int64_t val ) {
    while( pause( loc ) )
        if( __sync_bool_compare_and_swap( loc, empty_val, val ) )
            break;
    return;
}
 
...
#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
 
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val = vid_values[i];
                vid_values[i] = vid_values[j];
                vid_values[j] = tmp_val;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Analysis

Example 1.

The fallback path has a race with the code in the RTM path. For example, the following interleaving may happen. (Always keep in mind that one should not make any assumption on relative speeds of threads!)

Thread 1
Thread 2
start critical section
 
read sums[mygroup]
 
 
do transaction that updates 
sums[mygroup]
write sums[mygroup]
 

As a result, the example occasionally loses the increment done in the RTM transaction.

Example 2.

Don’t let the HLE transaction fool you. When an HLE transaction gets aborted, it acquires the same mutex non-speculatively. When this happens, the case effectively becomes identical to Example 1.

Example 3.

Again, one should not make any assumption on relative speeds of concurrently executing threads. Even though the fallback path is race free on its own, it has a race with the code in the RTM path. For example, the following sequence of events may occur.

Thread 1
Thread 2
read_and_lock( vid_values[i]  )
 
do transaction that swaps vid_values[i] and
vid_values[k] and makes vid_values[i] non-zero
read_and_lock( vid_values[j] )
write_and_release( vid_values[j] )
 
Wait for vid_values[i] to become 0
 

Possible Fixes

Now that we have concrete diagnosis for each of the examples, the fixes are straightforward.

Example 1.

Replacing ‘omp critical’ with an atomic increment such as __sync_add_and_fetch would fix the problem. I.e.,

    __sync_add_and_fetch( &sums[mygroup], data[i] );

 A more general solution is to use a mutex in the fallback path and add it to the readset of the RTM transaction to force the transaction to abort if the mutex is acquired by another thread.

mutex fallback_mutex;
 
...
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++){
        int mygroup = group[i];
        if(_xbegin()==-1) {
            if( !fallback_mutex.is_acquired() ) {
                sums[mygroup] += data[i];
            } else {
                _xabort(1);
            }
            _xend();
        } else {
            fallback_mutex.acquire();
            sums[mygroup] += data[i];
            fallback_mutex.release();
        }
    }

Example 2.

Similarly, we may extend mutex_elided to have the is_acquired() method. Since the lock variable is read inside the RTM transaction, any non-speculative execution of the HLE path which makes the change to the lock variable visible will abort the transaction.

    mutex_elided m;
#pragma omp parallel for num_threads(8)
    for(int i = 0; i < N; i++) {
        int mygroup = group[i];
        if( (i&1) ) {
            while(_xbegin()!=-1) // having no fallback path is
                ;                // UNSAFE
            if( !m.is_acquired() )
                sums[mygroup] += data[i];
            else
                _xabort(0);
            _xend();
        } else {
            m.acquire();
            sums[mygroup] += data[i];
            m.release();
        }
    }

Example 3.

We can also apply the mutex-based approach to this example. Another approach is to read the two ID values in the RTM transaction and check if either of them contains the ‘empty_value’. If so, we abort the transaction and force the thread to follow the fallback path.

#pragma omp parallel for num_threads(16)
    for (int i=0; i<n; i++) {
        int j = (int64_t) ( n * gen_rand() );
        if( _xbegin()==-1 ) {
            if(i != j) {
                const vid_t tmp_val_i = vid_values[i];
                const vid_t tmp_val_j = vid_values[j];
                if( tmp_val_i==0 || tmp_val_j==0 )
                    _xabort(0);
                vid_values[i] = tmp_val_j;
                vid_values[j] = tmp_val_i;
            }
            _xend();
        } else {
            if (i < j) {
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                write_and_release( &vid_values[j], tmp_val_i );
                write_and_release( &vid_values[i], tmp_val_j );
            } else if (j < i) {
                const vid_t tmp_val_j = read_and_lock( &vid_values[j] );
                const vid_t tmp_val_i = read_and_lock( &vid_values[i] );
                write_and_release( &vid_values[i], tmp_val_j );
                write_and_release( &vid_values[j], tmp_val_i );
            }
        }
    }

Conclusions

So, what have I learned from these examples? As you may have already noticed, all of these are related to the ‘restricted’ part of RTM. Intel TSX has great potential for improving performance of concurrent/parallel applications. But, the synchronization between the speculative code inside the RTM transaction and the non-speculative fallback path needs to be carefully managed, since the interactions are subtle. I gather most programmers won’t need to worry too much about it because higher-level abstractions in supporting libraries should hide most of agonizing synchronization details. But for those who are willing to get their hands dirty to squeeze out the last drop of performance gain, it always pays to have a watchful eye on the interactions between an RTM code path and its non-speculative fallback. (And we have many tools such as Intel SDE to assist you.)

Disclaimer: The opinion expressed in the blog is the author's own and reflects none of his employer's or his colleagues'.

For more complete information about compiler optimizations, see our Optimization Notice.