As is well known, x86 read-modify-write operations are implicitly sequentially consistent, so it may be beneficial, performance-wise and if possible, to, e.g., use extra storage and make two passes instead of simply using atomics.
But has anybody done meaningful testing of such atomic operations with and without compiler fences where such a transformation is not possible and atomics have to be used? Even if no longer relevant for copy-on-write std::string implementations (now forbidden by C++11), there are other situations where thread-safe reference counting is used, or global counters etc. The default atomic increment probably implies sequential consistency, as it does in TBB, but that is at a different level than for the underlying machine operation because of the implied compiler fences.
This architecture is lighter on registers than some others where the cost to vacate and restore active registers would be more obvious, and the locations would probably be exclusively owned by the local cache for all but the first iterations in a loop, but perhaps it would still make a noticeable difference in other circumstances?
There are many other things to worry about, let alone investigate, but I still wanted to tap the collective wisdom to at least satisfy my curiosity.
Also, perhaps a smart compiler might be able to analyse which registers could be left unaffected even for a non-relaxed atomic operation (not necessarily just for read-modify-write or sequential consistency): has anybody heard of any research in that area?