Data Structures with Atomic Operations

Data Structures with Atomic Operations

AJ's picture

I remember Arch saying something along the lines of implementing a data structure with atomic operations can sometimes provide no better performance than just wrapping a sequential data structure with a mutex.

Before I repeat the TBB teams' past work, I'm looking at wait-free data structures and the penalty of using atomic operations to implement them. Any general suggestions or observations on the performance of wait-free data structures?

In particular I'm looking at data structures in the "Art of Multiprocessor Programming" book to implement. Before I spend a lot of time doing this, any words of wisdom?

Thanks!

AJ

24 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
robert-reed (Intel)'s picture

So-called "lock free" parallel data structures rely on the conceit that data structure updates can be handled as atomic operations. You could argue that a mutex is a "lock free" parallel data structure in that it uses atomic operations to update its data without locking in order to implement,... well, a lock. This limitation of handling data structure updates only as atomic operations does restrict the variety of data structures that can be implemented in a lock-free manner.

I don't know what quote of Arch's that you may be referring to, but I do know that there are data structures with so many internal dependencies that implementations that try to use internal locks/atomics are not likely to generate better performance than alternatives that just wrap the data structure accesses in a global lock.

AJ's picture

Thank you for the response.

I have been fortunate, that the book "The Art of Multiprocessor Programming" has a collection of wait-free algorithms which I can implement directly using TBB in C++. Unfortunately they all have different consistency guarantees, but proper documentation can address this.

My previous post on atomic operation performance on Intel hardware gave me a general idea as to the performance to expect on highly contested data structures. What I didn't post, although perhaps I should, is that if you do a loop like for(int var = 0, int i = 0; i < 500; ++i) var++; the operations scale just fine... I take this to just mean that the cache lines have had a chance to update.

I see two ways to implement scalable wait-free data structures:

1) Use atomic operations only.

2) Use thread-local storage to perform updates on thread-local parts of a data structure, combined with a sprinkling of atomic operations which are seldomly called. For instance, an unordered_vector could work by giving each thread a segment of memory to fill with elements when it calls insert() (each segment is contiguous), when that segment is filled it gets another one which results in an atomic increment in the number of segments available.

After a great deal of thought, neither approach seems perfect by itself. Thus I have decided to implement internal structures to support both approaches. I should stress that my intention is to not allow the user to ever see these low-level details, they will simply be given the interface to the data structure and it will just work as advertised.

Thoughts?

Raf Schietekat's picture

A simple observation: "aj.guillon" writes about "wait-free" and Robert Reed responds regarding "lock-free". The former is stronger than the latter, and a mutex is never wait-free even if it is lock-free (lock-free at a high level: on Alpha and POWER/PowerPC the minimum is optimistic locks at the machine-language level, which I think is a RISCy thing to do, others will be locking somewhere further down, I suppose, although sometimes also higher up, like on PA-RISC, which doesn't make a good impression here). In another thread it has become clear that even if the machine-language level mentions "bus locking", it doesn't have to be as bad as all that, but I'm still sceptical (also because I'm still learning).

It would be interesting to hear about this straight from the horse's mouth, even if only in the form of links to Arch's previous writings on the subject: have results that discredit the relative efficiency of lock-free/wait-free algorithms been obtained on all sorts of hardware? If wait-free doesn'tsuch algorithms don't work better on neither Alpha nor POWER/PowerPC, I suppose it could only still be the programmer's lack in ability or motivation, not the hardware designer's, but I "hesitate" to accept results obtained on any other machine as final (in particular about skip list maps). Are these hardware dependency assumptions valid?

robert-reed (Intel)'s picture

Wait-free, lock-free, it seems easy to get lost in semantics here. Raf suggests and I would agree more generally that in the case of two HW threads trying to modify the same data structure, theres no such thing as a wait free algorithm: the best anyone can expect is to manage access so that the contentious delay of a second modifying thread is no worse than the time it takes to flush an atomic variable from the first thread and establish ownership to it by the second thread. But its still a wait..

Im afraid I did not follow the argument regarding the scaling of the cited for-loop. That loop could easily be optimized by a compiler to touch no cache linesvar and i could both be assigned to registersor a smart compiler could eliminate the loop entirely. How does that prove scaling?

AJ's picture

Sorry, let me clarify.

I built a function:
void instructionDelay(int cycles)
{
for(int i = 0; i < i; ++i);
}

I then compiled this function in a separate object file with -O0 to disable optimizations. This produces this (using objdump -dS on the .o file):

0000000000000000 <_Z16instructionDelayi>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: 89 7d f8 mov %edi,0xfffffffffffffff8(%rbp)
b: c7 45 f0 00 00 00 00 movl $0x0,0xfffffffffffffff0(%rbp)
12: 8b 45 f0 mov 0xfffffffffffffff0(%rbp),%eax
15: 8b 55 f8 mov 0xfffffffffffffff8(%rbp),%edx
18: 3b c2 cmp %edx,%eax
1a: 7d 0e jge 2a <_Z16instructionDelayi+0x2a>
1c: 83 45 f0 01 addl $0x1,0xfffffffffffffff0(%rbp)
20: 8b 45 f0 mov 0xfffffffffffffff0(%rbp),%eax
23: 8b 55 f8 mov 0xfffffffffffffff8(%rbp),%edx
26: 3b c2 cmp %edx,%eax
28: 7c f2 jl 1c <_Z16instructionDelayi+0x1c>
2a: c9 leaveq
2b: c3 retq

So I do a call to instructionDelay() with some integer argument *then* update the atomic value. I wanted to see just how much work had to be done to break-even on the cache-line transfer time on my system.

I agree that accessing the same region of data will create a contested segment of memory, and the cache will have to deal with that and you get bad
performance. That may or not be avoidable in all algorithms.

Raf Schietekat's picture

Answer to Robert Reed's post:

Still, it might be that if the wait is at a sufficiently low level (below the programming language) on a suitable processor, a "wait-free" algorithm might outperform one that isn't? I prefer the adversary process to run its course.

I thought that "aj.guillon" made "var" an int by accident and really meant an atomic. But it's obviously only a necessary condition for good scalability.

Arch D. Robison (Intel)'s picture

For terminology of wait-free versus lock-free versus obstruction free, the Wikipedia entry looks good. It references "On the Inherent Weakness of Conditional Synchronization Primitives",which shows that wait-free algorithms are inherently inefficient if compare-and-swap or load-linked/conditional-store are the only synchronization primitives available, and that other operation like fetch-and-add are necessary. Alas, no commercial architecture that I'm aware of has a scalable fetch-and-add (it can be done scalably via combining networks).

My opinion is that non-blocking algorithms are important in situations where lock preemption (arising from oversubscription) is an issue. But when oversubscription is not an issue, classic locked algorithms, combined withdesigning to avoid lock contention, will generally out perform non-blocking algorithms on current Intel hardware.I don't know about other hardware, but observe that:

  1. Non-blocking algorithms generally use more atomic operations than it takes to acquire a lock.
  2. Atomic operations in non-blocking algorithms generally need a memory fence.
  3. Memory fences incur an inherent penalty on out-of-order processors or processors with caches.

So on an in-order processor without any cache (e.g. Tera MTA, system-on-a-chip), non-blocking algorithms might be competitive evenin the absence oflock preemption.

The reality is that high performance processors are looking more and more like distributed memory machines; i.e., with performance dependent on very careful attention to cache locality.

Intel-specific note: For processors with more than one hardware thread per core (e.g. Pentium 4, Nehalem), a delay loop should have a pause instruction, otherwise it may unnecessarily starve the other hardware thread. TBB has this functionality in its internal routine __TBB_Pause( int32_t n ). On x86 machines, it executes n pause instructions.

- Arch

Raf Schietekat's picture

Ah, some light evening reading, thanks... Now will "aj.guillon" be the champion for wait-free algorithms? I would relish a good fight between, e.g., a lock-free (wait-free?) skip list map like Java apparently has, and anything with locks. Maybe I should ask its author for his current opinion and a C++ implementation?

Oh yes, did anyone at TBB ever get around to timing my latest concurrent_hash_map submission, after the application of an attempted cure for the obvious (performance) problem with the naive approach to erasing items? I think that even comparable performance would make it a preferable implementation, because it is easier to reason about (won't block except when so requested for exclusive element access, minimised contention, concurrent segment reallocation, no zombie elements, ...), but I wouldn't be altogether surprised if now it were also faster.

(Added) The relevance of that here would be that it might also give better relative performance on something like POWER/PowerPC with relaxed atomic operations (and explicit fences where needed), so the ultimate test would be there and on top of my atomics submission.

(Added) And please make sure that the performance comparison is fair, and not just a micro-benchmark on zombie element creation (one thread thinking it's using an element in the map, while another thread can simultaneously remove it and a third even insert another element with the same key!), which was a negative goal in my implementation (preserving TBB's previous behaviour).

Anton Malakhov (Intel)'s picture

Oh. Raf, sorry, not yet. It was difficult to run the same performance tests for the latest version as I made before due to your modifications to spin_rw_mutex and our code-freeze time. I do believe that your version has good points and performance and you are on the right way. We are finished with release activities few weeks ago and I'm ready to continue containers development now. But first of all, I'd like to rewrite concurrent unordered (hash) map from the ground up. And I hope to come up with a draft implementation in a few weeks or so.

Raf Schietekat's picture

I don't know what the performance tests do, but spin_rw_mutex is just a drop-in replacement with a little bit extra functionality only for internal use, so...

jimdempseyatthecove's picture

For my 2 cents worth on the Lock-Free / Wait-Free

Lock-Free: Code that can use CAS or DCAS, but does not use CAS or DCAS to set a mutex (software lock), for the purpose of holding a resource (e.g. modifying a structure). CAS and DCAS are interlocked operations (atomic) and unfortunately Intel chose the name of the prefix to perform the atomic operation as LOCK. This gives some the impression that a software lock is being performed when it is not. One potential problem with using a software lock (mutex) is the thread holding the mutex can be interrupted by the O/S, and suspended for an extended period of time. While the work intended to be performed while holding the software lock is short, The O/S intervention causes the time to perform the operation to become excessively long.

Wait-Free: Code that uses CAS and DCAS without software locks to transition the state of a structure, generally requiring several instructions and expressly where the change sequence is made available during change operation whereby other threads can advance the changes initiate by the first thread to begin the state change. Even for simple operations, bug-free wait-free coding is extremely difficult to write.

I consider myself an experienced programmer (40 years) and I have written a Lock-Free / Wait-Free simulation of a DCAS on a system that supports only CAS instructions (e.g. Opteron 270 in 64-bit mode to simulate 128-bit DCAS). And then used this to produce a Lock-Free / Wait-Free FIFO using singly linked nodes with Head and Tail pointers. Getting this relatively simple process (a FIFO) to workError-Free / Lock-Free / Wait-Free to run a stress test verification program using 1-16 threads on 4 available cores on both the2xOpteron 270 (without 128-bit atomic compare and exchange) and on Intel Q6600 (with 128-bit compare and exchange). Writing this piece of code took a good chunk of time (several months).

Extending Lock-Free / Wait-Free to something more complicated than this relatively "simple" FIFO would be daunting.

The easiest way to avoid having to write Lock-Free / Wait-Free would be to eliminate the requirement for having to write this type of code. The suggestion I have for this is to re-introduce a quirk (feature) that was available on the PDP8 minicomputer. A user modeapplication could temporarily inhibit interrupts (and O/S context switch) without using a privledged Interrupt Off instruction. Although the particularinstruction was designed for a different purpose, realtime programmers made good use of it to perform atomic operations on structures (e.g. updating a FIFO list and Head/Tail).

Jim Dempsey

www.quickthreadprogramming.com
AJ's picture

I'm glad that I've stimulated a good conversation here.

First, to Raf... I prefer AJ to "aj.guillon"... seeing my name in quotes delimited by a period just seems strange to me :-) , I thought I had changed a setting for that in the forums... I'll have to see what I can do to fix how it displays.

So far as being the champion of wait-free or lock-free algorithms, I am very intersted in implementing a few data structures to measure their performance against say STL containers with a lock. The reason I haven't just jumped right into coding, is that I'm still in the reading phase.

Arch's comment on multi-core systems really requiring distributed data structures is a very relevant comment... and distributed data structures is the next reading topic for me personally. This is partially where my comment related to thread-handles comes in... the idea is to have a special interface that threads use to obtain a handle to a data structure. This is hidden from the user, and ideally "just works" without the user seeing the details. The thread then proceeds to work on its piece of the data structure, or whatever it is, without contention.... atomic operations are used for synchronization as little as possible. This approach would require special data structures of course...

What would be really nice, is if hardware transactional memory just sprung into being and we didn't have to worry about any of this. Although I'm not sure what the real performance of such hardware would be.

Raf Schietekat's picture

Hey, DCAS, that's CAS2 on Motorola's 68k series! Long time no see (I never used it, though). The code example in Appendix D of my 68020 User's Manual seems like the work of at most a few hours, not a few months? According to Wikipedia, the only other implementation with some visibility will be Sun's new Rock (along with transactional memory).

"A user mode application could temporarily inhibit interrupts": If there is no enforced upper time limit, I suppose that that could be used to compromise the system (DoS or something), otherwise I'm curious what the error handling provisions would be. I suppose Alpha/POWER/PowerPC's optimistic lock is lost on interrupts (otherwise there's an ABA problem, right?), although it only seems to work to protect a particular memory location (I should check again how many simultaneous locks there are here on each processor): maybe it could be used as a (non-nestable?) device to avoid convoying even for broader purposes than implementing atomics, if avoiding convoying is the only rationale for lock-free/wait-free algorithms?

(Corrected) What was I thinking?

Arch D. Robison (Intel)'s picture

Disabling interruptscreates atomicity only on uniprocessors.

Transactional memory in hardware is an intriguing approach for writing nonblocking algorithms, but proposed ways to implement it in hardware do not make it an easy way to writenonblocking algorithms, because of the lack of guarantees. For example, one common idea for hardware TM is to use the cache to hold a transaction. But that limits transactions, in the worse case, to the cache's associativity. Because archictures (so far!) do not promise a minimum cache associativity, code on such a machine can never rely solely on hardware transactions, but instead must always have a backup plan based on architectural guarantees (e.g. atomic operations).

See Section 2.1 of http://www.unine.ch/transact08/papers/Moir-Adaptive.pdffor a sketch of Rock's TM capabilities.

jimdempseyatthecove's picture

Adrobiso,

You missed the point on the temporarily disabling of interrupts.

My suggestion is lock with temporary disabling of maskable interrupts. What this provides is the user application to perform a software lock with the assurance that for the next n instruction fetches that those instructions will be executed without the possibility of the operating system context switching the thread out and inducing a long duration ownership of the software lock.

Sure, other threads (on other cores) can contend for the data structure but the data structure is protected by a software lock. And it is known by the other threads that the wait duration will be very short.

It would be the user's responsibility to ensure the code to be performed would do so without fault (e.g. divide by 0 or page fault). Page fault will be a nuisance but manageable.

The driving reason to code Wait-Free is to provide a work around for the Operating System interrupting a relatively short piece of code and then cascading long waits by other threads attempting to get the software lock. The temporary interrupt inhibit is an easy means to avoid inopportune operating system interruption of a short code section.

The instruction could be defined something like

IINCOBT imm8

Inhibit Interrupts N instruction fetch Cycles Or until Branch Taken (N=imm8)

This would be generally followed by a LOCK CMPXCHG to attempt to get the mutex. Failure to obtain the mutex (in a well coded routine) results in a branch to a retry location and thus re-enabling the interrupts. If the mutex is obtained the code falls through thus continuing the uninterruptible zone.

The fall through code would issue VERR and VERW to assurethe codestill has access to resident virtual memory and if not, perform a conditional move to release the mutex, thenbranch to code to touch the memory and then retry. If the VERR/VERW succeeds, the conditional movefails to reset the mutex andcode falls through the branch to retryto the next VERR/VERW as needed, then when all addresses are known to be addressable you perform the change to the structure (e.g. adding a node to a linked list). I will try to work up an example.

Although this code is harder to write than "let them wait", it is orders of magnitude simpler to write than Lock-Free/Wait-Free programming. You would have onlytwo coding issues to resolve: 1) Write the changes to the data structures in strait line code (no branching), and 2) verify you have access to addresses you intend to manipulate.

(By the way, CAS and DCAS are pseudo code, LOCK CMPXCHG would be used on Intel platform to perform CAS)

Jim Dempsey

www.quickthreadprogramming.com
Arch D. Robison (Intel)'s picture

Now I see what you mean. Yes, being able to protect against interrupts over tiny critical sections could get the practical advantage of a non-blocking algorithm without the hassle.

Raf Schietekat's picture

Strange that Rock's transactions are so very "shy", much like the optimistic locks in Alpha and the POWER family, which are like single-memory-location transactions. But the decision on whether to roll back or letting the other guy wait a while probably depends on how big the transaction will be, which would need to be declared or cached. Still, it could also decide to, e.g., let the transaction continue for a time related to how long it has already been running (its own investment).

I wonder how that capability will be reflected in high-level programming languages like C++, which is still trying to cope with multithreading...

P.S.: Please disregard part of an earlier post of mine, as corrected above.

Arch D. Robison (Intel)'s picture

If you want to play with transactions in C++, try out the Intel C++ STM Compiler. It's described on http://software.intel.com/en-us/articles/intel-c-stm-compiler-prototype-edition-20

jimdempseyatthecove's picture

>>Now I see what you mean. Yes, being able to protect against interrupts over tiny critical sections could get the practical advantage of a non-blocking algorithm without the hassle.<<

I would call it a short-term blocking algorithm.

The technique I proposed was for the critical section to terminate on the earlier of

branch taken
number of instruction fetches
fault

It might be advisableto add to the fault list any attempt to re-issue the pause interrupts instructionwhile within a pause interrupts instruction (DoS avoidance).

It would have to be determined if it would be advisable to add REP, REPE, REPNE as well as other instructions (or prefixes) that may be used in a nefarious manner. With SSE3/SSE4.1 you can copy large blocks of memory using in-line coding and a relatively low number of instructions. So REPxx might not be needed.

Jim Dempsey

www.quickthreadprogramming.com
Dmitry Vyukov's picture
mad
reed:

Wait-free, lock-free, it seems easy to get lost in semantics here. Raf suggests and I would agree more generally that in the case of two HW threads trying to modify the same data structure, theres no such thing as a wait free algorithm: the best anyone can expect is to manage access so that the contentious delay of a second modifying thread is no worse than the time it takes to flush an atomic variable from the first thread and establish ownership to it by the second thread. But its still a wait..

Wait-free doesn't mean that there are no "waits". Wait-free means that every operation will be executed in finite bounded amount of time (regardless of behavior of other threads).

For example following operation is wait-free:
void rc_acquire(rc_object* o)
{
o->rc.fetch_add(std::memory_order_acquire);
}

And following is only lock-free (not wait-free):
void lifo_push(std::atomic& head, node* n)
{
node* cmp = head.load(std::memory_order_relaxed);
for (;;)
{
n->next.store(cmp, std::memory_order_relaxed);
if (head.compare_swap(cmp, n, std::memory_order_release))
break;
}
}

Informally, "CAS-loop" means lock-free, and "FetchAndAdd" or "AtomicExchage" means wait-free.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
JimDempseyAtTheCove:

The suggestion I have for this is to re-introduce a quirk (feature) that was available on the PDP8 minicomputer. A user modeapplication could temporarily inhibit interrupts (and O/S context switch) without using a privledged Interrupt Off instruction. Although the particularinstruction was designed for a different purpose, realtime programmers made good use of it to perform atomic operations on structures (e.g. updating a FIFO list and Head/Tail).

IBM z/Architecture currently have interesting instruction - PLO (perform locked operation).
See IBM z/Architecture: Principles of Operation:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
Appendix A -> Multiprogramming and Multiprocessing Examples -> PERFORM LOCKED OPERATION

PERFORM LOCKED OPERATION (PLO)
The PERFORM LOCKED OPERATION instruction
can be used in a multiprogramming or multiprocessing
environment to perform compare, load,
compare-and-swap, and store operations on two
or more discontiguous locations that can be words
or doublewords. The operations are performed as
an atomic set of operations under the control of a
lock that is held only for the duration of the execution
of a single PERFORM LOCKED OPERATION
instruction, as opposed to across the execution
of multiple instructions. Since lock contention
is resolved by the CPU and is very brief,
the program need not include a method for
dealing with the case when the lock to be used is
held by a program being executed by another
CPU. Also, there need be no concern that the
program may be interrupted while it holds a lock,
since PERFORM LOCKED OPERATION will complete
its operation and release its lock before an
interruption can occur.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
aj.guillon@gmail.com:
2) Use thread-local storage to perform updates on thread-local parts of a data structure, combined with a sprinkling of atomic operations which are seldomly called. For instance, an unordered_vector could work by giving each thread a segment of memory to fill with elements when it calls insert() (each segment is contiguous), when that segment is filled it gets another one which results in an atomic increment in the number of segments available.

This will work faster than lock-based implementation (provided careful implementation).
Privoded big private chunks and careful data placement you can get up to 100x speed-up.

Forgot about atomic operations and locks. They play only secondary role in performance/scalability. Primary thing is data sharing. Basically, count number of accesses to different shared cache-lines per operation, not number of atomic operations. (But doesn't count accesses to read-only shared cache-lines)

Yes, atomic operations have some fixed price (btw, on latest Intel processors atomics became much cheaper), thus affects performance. But data sharing is much more expensive, and it affects not only performance but scalability of system too. Note that atomic operations can be (and usually is) combined with data sharing, this is worst scenario.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
aj.guillon@gmail.com:

I have been fortunate, that the book "The Art of Multiprocessor Programming" has a collection of wait-free algorithms which I can implement directly using TBB in C++. Unfortunately they all have different consistency guarantees, but proper documentation can address this.

I was thinking that in this book authors entirely rely on Garbage Collector, because I don't see even mention of Safe Memory Reclamation problem in contents of the book. So I was thinking that algorithms are not directly implementable in C++.
Are they provide solution for memory reclamation in lock/wait-free algorithms? Or are you going to use some existing memory reclamation scheme (SMR, RCU, proxy-collector etc)?

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Login to leave a comment.