just got my fingers on a Haswell system and tried the new TSX extension, hoping to boost performance of my multi-threaded app.
But what I found was rather shocking, the numbers are execution times in microseconds:
A) 29122 - App running with a single thread and without any locking
B) 42762 - Same as A) above, but just adding an XBEGIN/XEND pair (with nothing in between) at the critical sections. So even though I don't do any transaction yet, the code takes 46% longer to execute. That's much more than I had expected.
C) 50410- Like B) above, but now the XBEGIN/XEND is placed around the critical section, and a lock is acquired if the transaction in the critical section fails (i.e. that's now the way it's meant to be played, just running a single thread). Locking is done with a pause/lock cmpxchg spinloop.
D) 47591 - Dropping the XBEGIN/XEND and just using the old-fashioned pause/lock cmpxchg spinloop to protect each critical section. So TSX C) is slower than acquiring a lock in a single thread.
E) 10697 - Like C), using TSX but running 8 threads
F) 9935 - Like D), using old-fashioned locking and running 8 threads
Summary: no matter if I am using 1 or 8 threads, TSX takes ~7% longer than normal locking. Of course I am not talking about "beginner's sissy locking", where you have one single global lock, but about performance-tuned fine-grained locking.
Performance with XACQUIRE/XRELEASE prefixes was even worse, BTW.
So the question: has anyone of you managed to improve performance with TSX over old-fashioned (but fine-grained) locking?