Often programmers starting with Intel transactional Synchronization Extensions (TSX) don't fully appreciate that TSX transactions are best effort and always need a fallback path. So you cannot just use _xbegin() as a "super CMPXCHG". TSX always needs fallback code that guarantees progress without using transactions, otherwise the code may fail to make progress. Just putting a retry loop around the transaction is not enough.
The simplest case of course is lock-elision. Simply use a lock for fallback. If that is done the program doesn't even need to know about TSX, it can just use locks, and the lock library takes care of the transactions and the fallback. Chapter 12.3 in the optimization manual describes how to add TSX to an existing lock library. This is expected to be the most common model, as it fits best to common existing software.
With the non-blocking "fast path" offered by the transaction it is also often possible to use more coarse grained locks to get good enough performance, while simplifying the program. As a extreme case it may be possible to use a global lock (or very few locks), which avoids lock ordering problems and simplifies the program. Of course for best performance the program may still need a few changes to improve elision success.
Other fallbacks for transactions are also possible. For example gcc 4.8+ with -fgnu-tm supports falling back to a software transactional memory code path. The programmer uses the __transaction_relaxed keyword to specify transactional regions (see the specification) The compiler generates two code paths for transactional code: one un-instrumented, and an instrumented one where every global memory access calls into the libitm library. On execution the un-instrumented path is first executed as a TSX transaction, and when the transaction fails the instrumented path together with the software transactional (STM) libitm library is used as a fallback.
The STM library would typically use (fine grained) locks internally, but to the program these locks are not visible.
The instrumentation usually has a high single-thread performance overhead, so the fallback path may be somewhat parallel, but still be relatively slow.
Some other programing languages, like Clojure or Haskell GHC implement similar STM models. Neither is currently using TSX as a fast path, but I expect they eventually will.
Another case would be to use an existing non-blocking (lock less) algorithm as a fallback path. Essentially the fallback for the "super CMPXCHG" XBEGIN would be real CMPXCHG (or other atomic constructs). Of course to do this you need a non blocking algorithm first. And the non blocking algorithm has to have at least one path that does not rely on a transaction, but only on traditional atomic operations. The literature has many such algorithms, many of them complex and non trivial (when reading a non-blocking paper, always look for the errata first). For example there are many versions of lock less hash tables and lock less FIFOs.
If you already have a good lockless algorithm implemented, in many cases it may be fast enough by itself, so you don't even need a transactional fast path. But in other cases the lock less implementation may have high overhead, so transactions may still win. Or it may have a complex trade-off (for example commonly between reader and writer performance) and transactions may help to reduce the penalties for some of these cases. The important part is just that there is always a non transactional fallback path to guarantee progress.
The performance trade-offs for all these alternatives are different. TSX also changes the picture considerably, as it adds a non-blocking fast path. In many cases a simpler implementation (e.g. a simple coarse grained lock) may be competitive or even better now than more complex alternatives. But it always depends on the exact workload, so benchmarks of the actual application (not micro benchmarks) are needed.
Of course in many cases it's academic: if the program is already using locks the effort needed to implement and properly test a non locking fallback path may be considerable. If the performance with simple lock elision (with a little tuning to avoid common aborts) is good enough there is not a good reason to spend the time on something more fancy.
So for a practical implementation it's usually advisable to just use the simplest approach that works fast enough.
In any case it's important that the fallback code correctly interoperates with the TSX transaction. The standard and simplest way to do this is to enforce the invariant that either the fallback path or the TSX path is operating on all CPUs, but never both. With locks this is typically done by just putting the lock variable into the read-set. Any fallback lock will write to the lock variable and then abort all parallel transactions.
Another common mechanism for such synchronization when not using lock is a sequence lock: increment counters before and after changing any state (with appropriate memory barriers) in the fallback and put both counters into the read-set of each TSX transaction. This may however need careful partitioning to avoid unnecessary cache line transfers between CPU cores on the fast path. Other techniques are also possible.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804