Although this is directly related to "Additions to atomic" I decided to spin it off into a thread of its own (please forgive the word play).
I have been confused about the best way to spin either for changing an atomic variable or for acquiring a mutex. Notice the different characteristics in the duration of the critical section (extremely short resp. the programmer thought it would be fairly short). Another characteristic that may be relevant is the level of contention, and this is currently being considered dynamically by the use of a backoff mechanism. Then there's the consideration for possible other work in another thread on the same core, so there's not just bandwidth to be considered when spinning. And at the hardware side: do all platforms in scope use write-and-update rather than write-and-invalidate, and what happens on each one for a failed CAS, and for the whole algorithm?
TBB itself recently made a change (details omitted here) that would indicate that its current approach needs further vetting. In my most recently contributed change for atomic I made a big mistake (quickly retracted) that alerted me to the fact that I'm not quite the expert on the matter (either?).
When I was wondering about AtomicBackoff/ExponentialBackoff (they're actually almost identical, so that refactoring could make one a subclass or typedef of the other, doubling the previous delay each time pause() is called), I thought that perhaps atomics/mutexes should be approached like what is done on a shared medium like Ethernet (the original shared coaxial cable anyway). But there are some differences in the techniques employed. On Ethernet, the network card first listens for traffic, and when it detects a silence it waits for a random amount of time before it will send anything (the level of congestion affects the mean value of the delay, but it's still random). On TBB, there's no listening step (which may lead to decreased overall performance due to increased chatter), and there's no randomness (which introduces unfairness against, potentially up to the point of starvation of, threads that have been waiting the longest).
Unfortunately, any introduction of complexity where contention is low will probably needlessly increase latency, and for lack of the resources and dedication that a scholar might have I'm leaving out the randomness here, except for the consideration that medium delay should probably be bounded so as to avoid gross unfairness and starvation (TBB's LOOPS_BEFORE_YIELD is not the same because it will increase unfairness even more, and it won't even do anything useful unless there are other users than TBB on the system), and also wondering whether __TBB_Pause() should not have a default implementation other than __TBB_Yield(). Likewise, I don't know how useful and difficult it would be to base the initial pause on the last pause of a previous acquisition.
So, without further ado, here's a proposal for your consideration about a first possible improvement, only for mutex at this time:
for( AtomicBackoff b; test_and_set()==HELD; b.pause() );
For reference, the TBB status quo is:
If you're not already familiar with the issues, remember that, individually, a thread can get better service by never reading or pausing, but, like on Ethernet (the original one anyway) or Internet (where you can install some evil device on your computer to browse faster), overall throughput will suffer, especially if everybody does this. If you are an expert, please chime in, because this may well be just reinventing the wheel.