atomic xor,or,nand

atomic xor,or,nand

Is there a specific reason that tbb::atomic only supports fetch_and_add/sub but not the xor,or,nand ops supported by gcc atomic intrinsics?

Under which circumstances would the fetch_and_store opration be usefully?


8 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hear, hear, especially since x86 can implement them directly. But C++11 also omits them.

Unconditional swap can be used for spin locks. I can't imagine why TBB doesn't (see default implementation of __TBB_TryLockByte()), but I haven't looked at this code for a while.

Fetch-and-store is useful forlock-free manipulation ofsingly-linked lists. For example, you can grab an entire list by doing fetch-and-store on the location holding the root, and storing a NULL pointer in exchange for the root of the list. In the TBB sources, search for "return_list" in src/tbb/scheduler.cpp to see an instance of this technique. The other common idiom for fetch-and-store is appending an item to a singly-linked queue. See method mailbox_outbox::push in src/tbb/mailbox.h for an example. The appending trick also shows up in qkueuing_mutex::scoped_lock::acquire (in src/tbb/queuing_mutex.cpp).

As far as I know, fetch-and-nand has no real-world use. I'd like to hear from others about any use case they have found for it. Note that C++11 does not have fetch-and-nand either.

As for fetch_and_xor and fetch_and_or, we had to stop somewhere, and I originally drew the line at where the Intel hardware stopped having support. That was in the first year when TBBwas aproprietary product. Now that it's cross-platform, perhaps we should be less provincial and include fetch_and_or and fetch_and_xor, or perhaps even better,include ageneric fetch-and-op that takes an arbitraryfunctor for the"op".

Is there any advantage of CAS over fetch-and-store for try-lock, like cheaper failure (acquire on success, relaxed on failure)? Not on x86, I would think, so LOCK XCHG seems cheaper, at first sight, than LOCK CMPXCHG.

C++11 doesn't seem to have any of those fetch-and-bitop operations (maybe I overlooked them?).

Aren't fetch_and_{and,or,xor} all supported (by combining the basic operation with LOCK)? Unlike nand, they're elementary operations in C++, so it doesn't seem far-fetched to have them for the sake of orthogonality, even if the general case isn't supported...

(Of course anything can be built from CAS.)

In theory, CAS does not have to dirty the cache line on failure. In practice,I'm not sure if it does dirty the cache line or not. I recall timing the alternatives years ago on a Pentium 4 when TBB was first being developed, but forget the results. On big-core machines most of the cost is the fence-related issues, so the particular instruction probably does not make a big difference.

See table 147 in 29.6.4. It lists add, sub, and, or, xor. The last three are bitwise.

x86 supports atomic {and,or,xor} via the lock prefix. Alas the instructions don't give the programmer the value that was read from the memory location, so they can't do the "fetch" part. That's the reason xadd was added to the instruction set even though there was already "lock add".

"In theory, CAS does not have to dirty the cache line on failure."
Really? Like an implicit test-and-test-and-set? I'm all for it, but then why would the more elaborate pattern even be a pattern? Meditate on this, I will.

"See table 147 in 29.6.4. It lists add, sub, and, or, xor. The last three are bitwise."
Of course (maybe I was searching for how TBB would name the operation?)...

"Alas the instructions don't give the programmer the value that was read
from the memory location, so they can't do the "fetch" part."
At the risk of another mistake (looking at N3242, not the definitive version): it seems unfortunate (or maybe not so much, but orthogonalty makes me happy) that C++11 doesn't give access to those fetch-less atomic operations that are directly supported by the dominant architecture.

Yes, so far hardware has been dumb about doing test-and-test-and-set automatically for CAS (sigh).

A an optimizing compiler should have no trouble optimizing the fetch-and-op forms into the fetchless forms when the result is not used.I tried the following example with icc 12.1.3 and gcc 4.4.5:

int x;

int y;
int foo() {

    int z = __sync_fetch_and_xor(&x,42);

    return __sync_fetch_and_xor(&y,42);


and both compilers optimized the first xor into "lock xor", since z is unused, and used a compare-exchange loop for the second xor only.

Pretty neat (disregarding of course that it's not portable, requires optimisation and even then is not guaranteed)!

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui