TSX anti patterns in lock elision code

Lock elision is a new way to scale programs. It requires following some rules for correctness and good performance. The basic rules are documented in the SDM, chapter 15, and the OPT-GUIDE, chapter 11.

After mastering the basic principles there are some standard mistakes in implementing lock elision that can dramatically lower performance or even affect correctness.

This document is not an introduction to TSX, but assumes the basics are already well understood (if not see ACM article or TSX-resources first).

The problems documented here are only for the actual lock elision code. If you use an existing lock library that implements lock elision you may not need to worry about this (assuming the lock library designer got everything right) It can be a concern for applications when writing custom locking code or custom elision wrappers in an application.

Of course application code still may need some extra tuning to minimize unnecessary transactional aborts (see the OPT-GUIDE, chapter 11, for details), but any problems there are localized and typically less severe. But getting the basic locking code right is important as it will affect all elided locks. I may write about application level problems some other time, this is just for the lock implementation.

The lemming effect

Elided locks enforce an invariant that either all users of a lock are eliding or all take the lock for real. There are some special cases that can avoid this invariant, but the common case always does this. The invariant makes sure that the locking semantics are correctly enforced against any thread that is holding the lock for real.

Elision enforces this by putting the lock variable into the read set of the transaction. Any thread taking the lock for real will write to the lock variable and force all threads currently eliding on the lock to abort. These threads will then either retry the transaction or try to take the lock for real. When multiple threads are doing this in parallel they can get in a state where they continuously prevent each other from getting back into the elided state, stopping lock elision for a longer time. This is called the "Lemming effect" (Dice et.al "Applications of the adaptive transactional memory test platform"), after the fictional lemmings who follow each other in jumping from cliffs (real rodent lemmings are not suicidial)

Non suicidial rodent lemming

Non suicidial rodent lemming

The lemming effect is only a problem when the abort rate is getting high. If the abort rate is very low it is not visible. Essentially lemming effect makes it more costly to recover from aborts. There are a range of design choices in the elided lock code that can dramatically increase the lemming effect. With some precautions the problem can be minimized.

Many of the anti patterns below deal with the lemming effect, in particular lock implementation choices that accelerate it.

Anti-Pattern 1: No PAUSE for HLE

Here is a simple (incorrect) HLE enabled spin lock (written using tsx-tools hle-emulation.h functions):

/* lock == 1 means unlocked */
void hle_spin_lock(int *lock)
{
    while ((int)__hle_acquire_sub_fetch4(&lock, 1) < 0) {
        /* spin until the lock is free before retrying */
        while (*lock != 1)
            __sync_synchronize(); /* re-read "lock" */
    }
}

(wrong)

The lock is initially 0 meaning free. The compare exchange tries to set the lock to 1, if it was 0. Otherwise it spins for the lock to become free before trying again. The compare exchange has a XACQUIRE HLE prefix, which starts a transaction to elide the lock. This example does not do a PAUSE after a lock acquire failed.

The HLE hardware does not know if a compare exchange was successfully acquiring the lock or not. It always starts the transaction unconditionally. But if the lock was locked the elision cannot elide successfully, as it would violate the lock-or-elision invariant.

HLE lemming effect without PAUSE

HLE Lemming effect without PAUSE

Now consider what happens when multiple threads are attempting to acquire a locked lock. They all start a transaction and start spinning inside the transaction. Eventually the original lock holder unlocks the lock and aborts all the other threads spinning in a transaction. They immediately go back to trying to acquire the lock non transactionally, which prevents successful elision taking place again. This continues for some time until all threads trying to acquire the lock have cleared the queue. This is the lemming effect in action.

A better way is to add a PAUSE inside the spin loop (this is a good idea for other non TSX reasons anyways). PAUSE forces an abort, so all the spinning will be done outside transactions. When the loop see the lock getting free it go back to the HLE instruction and try to re-speculate at the same time, which has a good chance of working. Also if one still needs to take the lock the others have a chance of successfully re-speculating later, instead of being doomed to take the fallback lock.

This prevents the cascading effect of multiple aborts and the recovery from the fallback is much faster.

HLE fallback with PAUSE

HLE fallback recovery with PAUSE

Corrected example with PAUSE:

void hle_spin_lock(int *lock)
{
    while ((int)__hle_acquire_sub_fetch4(&lock, 1) < 0) {
        /* spin until the lock is free before retrying */
        while (*lock != 1)
            _mm_pause(); /* Re-read lock and abort. */
    }
}

Essentially it's good to retry elision in parallel, but bad to retry the locking fallback in parallel. If you think of lock elision like a state machine that runs on every CPU the "take fall back lock" state needs to be desynchronized.

Always add a PAUSE into a HLE spin loop.

Anti-Pattern 2: Retrying RTM too fast

Another similar problem is retrying too fast in a RTM transaction. Adding some retries to RTM lock elision is usually good for performance, as aborts (such as conflicts) are often transient and can be resolved with a retry. Also when the lock is hold for real it may be freed quickly and the the transaction would succeed, without needing to fall back to a real lock.

#define RETRIES 3
int i;
unsigned status;
for (i = 0; i < RETRIES; i++) {
    if ((status = _xbegin()) == _XBEGIN_STARTED) {
        if (lock is free)
            return; /* Execute critical region */
        _xabort(0xff); /* lock busy */
    }
    if (!(status & _XABORT_RETRY) &&
        !((status & _XABORT_EXPLICIT) &&
            _XABORT_CODE(status) == 0xff))
        break;
}
/* Take fallback lock */

This example retries immediately when it detects an abort. Now consider CPU A is holding the lock for longer time. CPU B is trying to acquire the lock and sees it being busy. It starts retrying. The retry count (3) on B would be used up very quickly, without accomplishing anything useful, as A keeps holding on to the lock. Eventually CPU B will fall back to acquiring the lock for real. If multiple CPUs are doing this in parallel they also can get into a lemming pattern. They queue behind each other waiting for the real lock and do not get back quickly to the elision state.

The fix is to not use up the retry count too quickly. This can be done by inserting a wait for the lock becoming free before doing the retry. This way the retries for lock busy are far more effective, as they actually have a chance to succeed, and the recovery from one CPU taking the lock can be much faster.

Here's a modified code example that implements this wait.

#define RETRIES 3
for (i = 0; i < RETRIES; i++) {
    if ((status = _xbegin()) == _XBEGIN_STARTED) {
        if (lock is free)
            return; /* Execute critical region */
        _xabort(0xff); /* lock busy */
    }
    if ((status & _XABORT_EXPLICIT) && 
        _XABORT_CODE(status) == 0xff) {
        Wait for lock to become free
    } else if (!(status & _XABORT_RETRY))
        break;
}

An open implementation detail is how to implement the wait for the lock. For spinlocks it is very easy. Just spin until the lock becomes free. For sleeping locks it would need a primitive that waits for the lock to become free without acquiring it. This is usually not provided in OS interfaces (and could potentially have thundering herd effects). One compromise is to spin on the lock but only for a limited time, similar to an adaptive mutex.

When aborting due to lock busy wait for the lock to become free before retrying.

Anti-Pattern 3: Retrying RTM too often

Adding retries to RTM elision usually improves performance. Increasing the maximum retry count more and more can be tempting because it will improve performance in some workloads. However when the retry count is too high there is a risk of massively accelerating the lemming effect on other workloads. For example consider two CPUs A and B both attempting speculation. They run their critical section for a while and then after a while reliably each causes a conflict that aborts the other. Then they retry and the same happens, for a long time. The only way to get out of this situation is to go to the fallback lock. But when the retry count is too high they will waste a lot of time mis-speculating until they finally decide to fall back.

Avoid too many retries in RTM elision.

Anti-Pattern 3: Retrying RTM forever

An extreme case of the RTM retry pattern would be to retry forever (for example if no fall back lock exists). This is not only a performance problem, but also just incorrect. TSX does not guarantee any transaction succeeding, so this may lead to program hangs.

This particular mistake is unfortunately very common in introductionary slides.

A standard example where a hang could happen is at program start up. A memory page used inside the transaction needs to be faulted in the first time it is accessed. The page fault cannot be processed inside a transaction and always causes an abort. With normal lock elision it is executed while holding the fallback lock. But if there is no fallback lock (or other fallback path) the page fault will never succeed and the program will not make progress.

Other similar cases can happen with dynamic linkers, debuggers, page dirty bits, other exceptions, or some other execution patterns.

Always limit the number of retries.

Anti-Pattern 4: Not putting lock into read set

Lock elision relies on the lock variable itself being in the read-set to guarantee full synchronization between speculation and real lock holders. This requires reading the lock variable at least once in the transaction. HLE does this automatically if the lock only uses a single lock variable. It is possible to do it wrong even with HLE when the lock has multiple lock variables.

Consider an naive implementation of elision using RTM in an existing program

if (_xbegin() == _XBEGIN_STARTED) {
    shared_var++;
    _xend();
} else {
    lock(&lockvar);
    shared_var++; /* read-modify-write */
    unlock(&lockvar);
}

(wrong)

CPU A is going to the fall back path. It reads does a read-modify-write to increment shared_var. After A read shared_var, B comes in an does a transaction that also increases shared_var. A is still stuck doing its read and the write from A is not visible before B commits. We would lose an increment.

Another similar case is the fallback path starting after a transaction started. In this case the transaction may also not notice any update if it happens after the commit, but the transaction state may still be inconsistent.

If we include lockvar into the read set and check if it is free these problems are avoided. Any write to the lock variable causes an abort and transaction and fallback path are properly synchronized:

if (_xbegin() == _XBEGIN_STARTED) {
    if (lock var is locked)
        _xabort(0xff); /* Lock busy */
    shared_var++;
    _xend();
} else {
    lock(&lockvar);
    shared_var++;
    unlock(&lockvar);
}

Wooyoung Kim has some more examples of this problem.

For RTM always include the lock into the read-set and check if it is free.

Handling locking libraries that do not expose lock variable

A lot of locking libraries (e.g. pthreads or omp critical) may not expose lockvar directly to the application, but hide it in an abstract data type or in the runtime system.

The best option of course is to add the elision to the lock library itself.

In theory try_lock could also be used for checking the lock variable. (if it succeeds the transaction would abort, so the locking is undone). However this has some drawbacks: it does not work correctly if the lock is recursive. Also some lock libraries that have been TSX enabled abort trylock to guarantee identical answers to non TSX locking. If the transaction used such a try_lock for a lock check it would never succeed. TSX implementations of lock_is_locked() often have similar problems. This implies that an TSX enabled application would conflict with an TSX enabled lock library.

An alternative in this case is to use a second lock variable that is always written both by fallback path and read by the transaction before doing anything else. When doing this make sure the compiler cannot move the write in the fallback path by adding appropriate memory barriers and that all possible paths correctly handle this second variable.

Lock variables in reader/writer locks

Another question is what is the lock variable for a reader writer lock. These often have multiple lock variables, separated for the read and write cases. For writers it is always both the read and the write lock variable For readers it can be in principle only the write variable to let readers speculate independently, if we can put the two variables on different cache lines (also see next section for some caveats). The simplest and safest implementation is to always check both reader and writer. For HLE this means they may need to be combined into a single variable.

Anti-Pattern 5: Using xtest instead of lock test at end

For RTM lock elision and when the critical section is shared with the fall back the unlock normally needs to know whether it was running transactional or with the fall back block. One tempting way to write this is like

void rtm_unlock(int *lock)
{
    if (_xtest())
        _xend();
    else
        unlock lock;
}

(wrong)

However this breaks with nested locks. The outer look may be elided, but the inner lock not elided for some reason (e.g. the locking was adaptive and the adaptive algorithm decided to not try the transaction, or some other library used transactions independently). In this case the inner lock unlock will commit the wrong transaction when it only checks _xtest().

A better way is to check the lock variable is free. In a transaction it should be always free (see previous section)

void rtm_unlock(int *lock)
{
    if (lock is free)
        _xend();
    else
        unlock lock;
}

It may be tempting to ignore this when you know there is no adaptive lock and no other nested transactional lock libraries. However adaption can have very nice performance benefits longer term for complex workloads, and it is a mistake to prohibit it in the basic design. Also it is bad style to not interoperate with other code using transactions.

The only drawback of the "is lock free" technique is when the unlock is ever called on a free lock then _xend() will cause an exception (or commit the wrong transaction) This is usually a programing error and should be fixed. If the previous non-elided lock library allowed this behavior it may be an option to handle it too for compatibility (at some extra fast path overhead cost).

/* Tolerant of unlocking freed non-nested locks */
void rtm_unlock_tolerant(int *lock)
{
    if (lock is free && _xtest())
        _xend();
    else
        unlock lock;
}

When guarded by a lock variable check, the _xtest() is correct. However when called on a free lock it may still commit the wrong transaction in the nested case.

Safe unlocking in reader writer locks

Finally how do we handle this on reader writer locks? The simple case is checking both reader and writer in all unlock cases. In principle it is possible to let readers speculate independently from other readers, as described previously. But if we do that we cannot test the reader for being free, to see if we are in a transaction, because it may be locked legitimately by another independent reader. To make this work we would need to pass the transaction state from the lock to the unlock, requiring new arguments to the lock interface. A simple per thread counter unfortunately also doesn't work, as it could also mis-handle nested locks.

In some languages (like C++) it is possible to use scoped locks where the "in transaction" state can be kept in the stack frame (see for example Thread Building Blocks for an implementation)

The safest and simplest way is to not allow readers speculate independently by always including the reader and the writer variable in the read-set.

Anti-Pattern 6: Using the RTM abort code as a profiler

_xbegin() returns a status code indicating why the transaction aborted. The status code is designed for adaptive locks, that is locks that use past behavior to predict whether the next transaction will likely be successful. It can be tempting to use this status word as a simple profiler by adding some global accounting to the program.

The status word has much less functionality than the full profiling support in TSX. The profiler allows to check for more conditions, gives much better information (for example aborted cycles) and allows to sample the actual abort point using the performance counter hardware. At some point in TSX tuning it is typically needed to use a full profiler anyways (see TSX resources or TSX on Linux perf). Don't bother cluttering up the program with an inferior home grown one.

Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.