Concurrent stack?

Concurrent stack?

alexandre.hamez's picture

Hello everybody,

I was wondering if there is any plan to add a concurrent stack to tbb? Indeed, a concurrent_queue is almost always the right way to pass some data from one thread to another one, but sometimes we would like to fetch the most recently added data.

Thanks!

51 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Arch D. Robison (Intel)'s picture

This is the first time anyone has requested a concurrent_stack.


If you have a multiple producers and a single consumer, it's simple to build a concurrent stack using a linked list and atomic operations on the root. Alas if there are multiple consumers, the ABA problem comes into play.


In the general multiple-producer multiple-consumer case, it may be hard to beat a plain STL stack with a spin lock around it, unlesssupport for avoiding ABA is available. (Alas, early Opteron andIntel x64 are missing the critical double-width CAS that avoids ABA).Now that I've said it is hard, someone can contribute a concurrent_stack that proves me wrong :-)


Raf Schietekat's picture

That's not even a challenge... :-)

(Added) No takers? To clarify, I meant that implementing such a Building Block is more of a chore than a challenge if, like for atomics, it is allowed to default to locking where specific support is not available (I would not know how to do this in general without such hardware support). Support can be had in the form of atomic operations beyond the size of a pointer as already indicated (how many of those early 64-bit x86 processors that don't provide that still exist, assuming that they're running 64-bit-pointer programs?), or primordial transactions from Alpha/ARM/MIPS/POWER/PowerPC (this is something they can easily handle), or ESA/390-z/Architecture PLO I suppose (I have not looked into it yet). Note that assembler-level programming is required in most cases, it probably cannot be efficiently done in general in terms of C++-level atomics.

Dmitry Vyukov's picture
Raf_Schietekat: That's not even a challenge... :-)
Support can be had in the form of atomic operations beyond the size of a pointer as already indicated (how many of those early 64-bit x86 processors that don't provide that still exist, assuming that they're running 64-bit-pointer programs?)



I was thinking about this some time ago. I conclude that:
(1) There is a relatively small amount of such lagging processors. And this number constantly decreases over time.
(2) Those processors were not multicore, so it's not very disastrous to use lock on those.
So I decide that it's beneficial to use algorithms which relying on double word CAS anyway.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
or ESA/390-z/Architecture PLO I suppose (I have not looked into it yet).



z/Arch PLO is capable of performing 6 types of operations:
- compare and load
- compare and swap
- double compare and
swap
- compare and swap and store,
- compare and swap and double store
- or
compare and swap and triple store
on 32, 64 and 128 (!) bit operands.
So maximum what you can get is 4 compares and 12 stores!

For stack you can use compare on 'version' field and stores to 'version' field and 'next' field.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Arch D. Robison (Intel)'s picture

I agree that the double-width compare-and-swap should be used if available; however, there must be a backup route for older hardware. On x86 this requires a run-time check of the hardware capabilities. Perhaps an abstract atomic (pointer+tag) can be provided, and on hardware without double-width CAS, the tag can be one bit less than a full word, and the missing bit used internally for a spin lock.


One other consideration. If the container really is just a double-width compare-and-swap on moder hardware, then memory allocation/freeing overheads will dominate the overhead. Thus an intrusive container interface might be more appropriate. Here's what the interface might look like (adapted from old KAI C++ internals):



typedef implementation-defined concurrent_intrusive_stack_link;


template concurrent_intrusive_stack;


Intrusive queues and stacks is something I've wanted in TBB for a while, but we've never had time to implement/test/document them. Furthermore, having gone that far away from classic STL, we probably want to make them non-blocking too, at least if hardware permits.


- Arch

Dmitry Vyukov's picture
MADadrobiso:

One other consideration. If the container really is just a double-width compare-and-swap on moder hardware, then memory allocation/freeing overheads will dominate the overhead. Thus an intrusive container interface might be more appropriate. Here's what the interface might look like (adapted from old KAI C++ internals):

typedef implementation-defined concurrent_intrusive_stack_link;

template concurrent_intrusive_stack;

Intrusive queues and stacks is something I've wanted in TBB for a while, but we've never had time to implement/test/document them. Furthermore, having gone that far away from classic STL, we probably want to make them non-blocking too, at least if hardware permits.



I completely agree that (1) intrusive containers are the choice for high-performance software/libraries, and (2) "single-threaded interfaces" are not suitable for high-performance concurrent collections. Ok, some interfaces are suitable, but not in general case.

Btw, there are specialized memory allocators from some types of concurrent collections. For example, this one is very effective allocator for fifo-queues. Allocation and deallocation don't involve atomic RMW nor heavy memory fences.
http://groups.google.com/group/lock-free/browse_frm/thread/2c6f6250762385e1

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Why a tag/lock combination, even if combined in one word?

Maybe it's just me, but I'd rather eat a lemon than use an intrusive type (they're so... intrusive!). How about instead internally refurbishing the temporary storage: how much additional performance would then be left untapped, and would it be relevant considering Amdahl's law?

For completeness: Itanium has a somewhat curious lopsided instruction cmp8xchg16 (it's even a misnomer, because it's not really exchanging 16 bytes, it's exchanging 8 and storing an additional 8) that may also do the trick (I have not yet checked whether a simple 8-byte tag write will properly synchronise with cmp8xchg16, but I presume it will, for this instruction to be meaningful), so the only ones left out from the architectures currently targeted by the "Additions to atomic" patch are (pre-Rock) SPARC (for which something with indexed storage or so can still be devised, I suppose) and PA-RISC (which will always need a lock to do anything).

Dmitry Vyukov's picture
MADadrobiso:

If you have a multiple producers and a single consumer, it's simple to build a concurrent stack using a linked list and atomic operations on the root. Alas if there are multiple consumers, the ABA problem comes into play.



ABA is not serious problem here. Real problem in concurrent stack is safe memory reclamation.
Even if double-word CAS is used to solve ABA problem, access violation (SIGSEGV) is possible in pop() function.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat: That's not even a challenge... :-)



How are you going to solve safe memory reclamation problem in stack?
Even if double-word CAS is used to solve ABA problem, access violation (SIGSEGV) is possible in pop() function.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Do elaborate why a SIGSEGV would be so inexorable...

Arch D. Robison (Intel)'s picture

Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.


For the ABA issue, we ended up using a tagged pointer and the following primitive, inspired by Itanium's cmp8xchg16:



serial_number __TBB_TaggedCompareAndSwap( location* address, ptrdiff_t new_sn, void* new_ptr, ptrdiff_t old_sn, void* old_ptr) {
serial_number result;
atomic {
result = old_sn;
if( address->sn==old_sn ) {
assert( address->ptr==old_ptr );
address->ptr = new_ptr;
address->sn = new_sn;
}
}
return result;
}


Machines with double-width compare-and-swap can easily implement this instruction, as can recent Itanium processors with cmp8xchg16. The irony was, to our chagrin, that Itaniums of the time did not implement cmp8xchg16, even though it was in the architecture manuals. A separate section of the manual had the fine print saying the instruction might not be available :-(


For machines without an obvious instruction, the special tag -1 was used internally to denote "locked". That let us implement the primitive using a spin lock for the "atomic" part.


- Arch

Raf Schietekat's picture

So no tag/lock combination after all?

What interface would it be that causes a safe memory reclamation problem?

Dmitry Vyukov's picture
Raf_Schietekat: Do elaborate why a SIGSEGV would be so inexorable...



Consider following stack implementation:
T stack::pop()
{
 node_t* head = m_head;
 for (;;)
 {
 if (0 == head)
 return T();
 node_t* next = head->next; (***)
 if (CAS(&m_head, &head, next))
 {
 T val = head->value;
 delete head;
 return val;
 }
 }
}


ABA aside for a moment. In line (***) SIGSEGV is possible, if concurrent popper will pop node and delete it.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Alexey Kukanov (Intel)'s picture
Raf_Schietekat: Do elaborate why a SIGSEGV would be so inexorable...


In pop(), for the stack implemented as single-linked list, the first operation would be to read the head and the second operation would be to read the head->next. There is a timing gap between two; if the thread execution is preempted at this exactly point, after the thread is back it might happen that the node pointed by head was popped by another thread, and destroyed. This is howyou get SIGSEGV, I believe.


I have a case that is much less restrictive than the usual semantics of the stack. In my case, it is guaranteed that objects (nodes) are alive all the time, and moreover the exact LIFO order is not important. Even in this "relaxed" problem, there seems no way to overcome ABA without DCAS.


Transactional memory would come handy,but I'm afraid it will be uselessly slow for a highly contended case like this one where all operations are done with the head.

Dmitry Vyukov's picture
MADadrobiso:

Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.



Personally I consider SMR (Hazard Pointers) as worst GC technique for lock-free algorithms. SMR has very strong advantage, it has upper limit on number of pending (unreclaimed) objects. But it has very high price, store-load style fence for every acquisition. And it can't acquire more than one object at a time. So if you need to travese list, you have to acquire/release every node in the list (i.e. N store-load memory fences).

Oh! And don't forget that SMR is patented by Sun!

There is a couple of other GC techniques.
RCU. Has very low overheads. Some implementations are patented by IBM.
vZOOM. Has very low overheads. Also patented.
Proxy-collector algorithm. Really very good for lock-free collections. It seems that some implementations are patented. Here is one of my implementations of proxy-collector:
http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0
And there is also Scalable Distributed Reference Counting. I hope that there are no patents on it:
http://groups.google.com/group/lock-free/browse_frm/thread/46d00a0f543a758e


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

My reasons for refurbishing storage: not intruding on the user's type, probably avoiding some (de)allocation overhead, avoiding reading from memory that has disappeared. Is it really imperative to come back from a high-water mark before the end of the concurrent_stack's lifetime?

(Removed related question.)


Dmitry Vyukov's picture
MADadrobiso:

Safe memory reclamation was the issue that we ran into when I had a summer intern, Prahabjan Kambadur, looking into lock-free algorithms for TBB back in 2005. Hazard pointers seemed like the way to go, though that would require that each thread using the container would have to register itself with a hazard-pointer overseer. The summer ended before we could get around to implementing hazard pointers. Also, at the time I thought that requiring registration was too much of a nuisance, but I've since changed my mind since no better alternative has come up.



As for registration. Yes, most GC techniques require some ugly things like thread registration/deregistration, or periodical activity from all registered threads, or some extremely OS-dependent code... or quite expensive.

Btw, oiginal non-amortized proxy-collector algorithm:
http://groups.google.com/group/comp.programming.threads/msg/d911c93ffa8d...
doesn't require any registration/deregistration, but is quite expensive - 2 atomic RMW operations to acquire/release a batch of nodes.
My amortized proxy-collector algorithm is extremely efficient, but do require periodic activity from all threads:
http://groups.google.com/group/lock-free/browse_frm/thread/18a9f43b93a6b3f0


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
MADadrobiso:

For the ABA issue, we ended up using a tagged pointer and the following primitive, inspired by Itanium's cmp8xchg16:




If you use some form of safe memory reclamation, then ABA problem usually just doesn't exist.
If thread holds 'A', than another thread can change structure's anchor to 'B', but CANNOT change back to 'A', because first thread holds it, so 'A' can't be reclaimed.
And some form of safe memory reclamation is usually needed anyway, so in practice in most cases ABA is not a problem, and you do not need double-word CAS.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
But I would be interested in a safe way to read arbitrary memory locations where no such guarantee exists, because I've been struggling with that question this week...




There is a number of techniques available. SMR, ROP, RCU, VZOOM, PC, DRC.
Almost all are patented. But one can use slightly modified versions of patented algorithms.
If you are interested in some technique I can provide links.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Hmm... Thread 1 and 2 see A, thread 1 takes it, thread 3 takes B, thread 1 pushes A, thread 2 wreaks havoc by not allowing for a pretty straightforward ABA? This can even happen with memory reclamation, it would just be less likely that some thread would recycle the memory and push it on the stack just at the wrong time, but not impossible. What am I missing?

Raf Schietekat's picture

That's what happens without synchronisation: R asks a question (reading arbitrary memory locations), D sees it, R removes it, D still answers it.

Thanks all the same, but the idea was to directly probe some location from an unmanageable number (many many thousands), and I don't suppose any of these techniques would improve over simply using pthread_sigmask() (the thing was that I had not looked beyond process-global signal-related operations)? But I may still need to dig deeper if that function is too expensive.

Dmitry Vyukov's picture
Raf_Schietekat: Hmm... Thread 1 and 2 see A, thread 1 takes it, thread 3 takes B, thread 1 pushes A, thread 2 wreaks havoc by not allowing for a pretty straightforward ABA? This can even happen with memory reclamation, it would just be less likely that some thread would recycle the memory and push it on the stack just at the wrong time, but not impossible. What am I missing?



You are missing that it's *safe* memory reclamation, not just memory reclamation :)
If thread 2 sees A, than it has acquired A, than A can't be recycled through memory reclamation system, than thread 1 can't push A back to stack.
Safe memory reclamation gives the same guarantees as GC in modern languages. I.e. thread can't re-allocate object, while it's still in use.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat: That's what happens without synchronisation: R asks a question (reading arbitrary memory locations), D sees it, R removes it, D still answers it.

Thanks all the same, but the idea was to directly probe some location from an unmanageable number (many many thousands), and I don't suppose any of these techniques would improve over simply using pthread_sigmask() (the thing was that I had not looked beyond process-global signal-related operations)? But I may still need to dig deeper if that function is too expensive.



They DO improve!

You are missing one more moment. You can use pthread_sigmask() (or __try/__catch) to solve the problem ONLY IN STACK algorithm. You can't solve the problem with pthread_sigmask() in, for example, queue algorithm and in other algorithms. Because in queue algorithm there are 2 kinds of hazards: (1) thread accesses potentially freed memory, (2) thread MODIFIES potentially freed memory. With pthread_sigmask() you can solve only problem (1). But not problem (2). So you will end up with threads sometimes modifiy random memory locations! In stack algorithm there is only problem (1).

Safe memory reclamation techniques (other names are PDR and GC) solves BOTH problems.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Merging two subthreads in one reply:

If thread 1 has to get clearance first before reusing A (most likely getting a different piece of memory instead), that would indeed work.

I didn't want to show my cards yet before I was reasonably certain they could form a winning hand, but my purpose with random memory probing is to look for magic values in hypothetical headers (relates to "hazard 1") and so amortise any expensive disambiguation operations needed before making a final decision (relates to "hazard 2"), unrestricted to any particular sequence of events. It seems unlikely, even before timing the calls, but maybe some insight will come out of playing with the idea some more.

Dmitry Vyukov's picture
Raf_Schietekat:
If thread 1 has to get clearance first before reusing A (most likely getting a different piece of memory instead), that would indeed work.




Indeed. Main question is how to do this.

Raf_Schietekat:

I didn't want to show my cards yet before I was reasonably certain they could form a winning hand, but my purpose with random memory probing is to look for magic values in hypothetical headers (relates to "hazard 1") and so amortise any expensive disambiguation operations needed before making a final decision (relates to "hazard 2"), unrestricted to any particular sequence of events. It seems unlikely, even before timing the calls, but maybe some insight will come out of playing with the idea some more.



It's difficult to completely eliminate window of inconsistency between probe and final CAS.
STMs can do this, but they have own problems and limitations.

Btw, you can use Relacy Race Detector:
http://groups.google.com/group/relacy
to validate your algorithm. Relacy is designed exactly for such tasks. I've validated several GC algorithms with Relacy, and it always finds some extremely subtle algorithmic errors, and incorrect placement of memory fences.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
But I would be interested in a safe way to read arbitrary memory locations where no such guarantee exists, because I've been struggling with that question this week...




Btw, I'm going to post here in near future my new proxy-collector algorithm and concurrent hash map algorithm which uses proxy-collector reclamation.
Preliminary benchmark results on Q6600 are following. On modest write workload hash map beats TBB's hash map by a factor of 20. On mostly read-only workload hash map beats TBB's hash map by a factor of 50. At least on 4 cores scaling is linear as opposed to TBB's hash map.
One of the reasons for such great scaling is exactly amortized proxy-collector algorithm.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

That's ridiculous. You are hereby forbidden to be that much faster.

Dmitry Vyukov's picture
Raf_Schietekat: That's ridiculous. You are hereby forbidden to be that much faster.



Well, until I post the code, you can't refute my point... and I can't prove my point too :)
I have to make some minor changes, conduct alpha-tests, and make final benchmarks. Then I will post the code. And we will be able to resolve the dispute ;)
Frankly, until this moment I was testing hash map w/o memory reclamation, i.e. memory was just not freed at all. Now I am bolting on proxy-collector. But I hope that proxy-collector will have minor influence on fast-path.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Alexandre, just out of curiosity: how would you use this stack and what are the desired performance characteristics, considering that it may be implemented as non-blocking on most architectures but will never scale well if implemented naively to always push or pop just one item (which may even become a bottleneck for your application)?

Dmitry Vyukov's picture
Raf_Schietekat: Alexandre, just out of curiosity: how would you use this stack and what are the desired performance characteristics, considering that it may be implemented as non-blocking on most architectures but will never scale well if implemented naively to always push or pop just one item (which may even become a bottleneck for your application)?



Alexandre, also take into account that many things, that are not relevant in single-threaded context, becomes relevant in multi-threaded context. For example: how many writer threads there are? how many threads access 'head' of the structure? how many threads access 'tail' of the structure? do you need to know the size of the structure? etc.
Addition (remove) of any of such requirements can dramatically decrease (increase) performance/scalability of data structure.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
randomizer:
Alexandre, also take into account that many things, that are not relevant in single-threaded context, becomes relevant in multi-threaded context. For example: how many writer threads there are? how many threads access 'head' of the structure? how many threads access 'tail' of the structure? do you need to know the size of the structure? etc.
Addition (remove) of any of such requirements can dramatically decrease (increase) performance/scalability of data structure.



For example here you can see extremely efficient implementation of single-producer/single-consumer lifo stack. It uses no atomic RMW operations nor heavy memory fences in push() and pop():
http://groups.google.com/group/lock-free/browse_frm/thread/4a410afb609e2ee1

Well, it's as efficient as lifo structure can be in principle. Lifo structure will always be substantially less scalable than fifo structure, just because it by definition assumes constant contention on one end.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat: That's ridiculous. You are hereby forbidden to be that much faster.



You are free to test it:
/en-us/forums/showthread.php?t=60494

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
alexandre.hamez's picture

My needs are quite simple. All I want is to have a concurrent container that favors a First-In Last-Out order as it will avoid, for the project I currently working on, some unnecessary work. So it means that I don't care if it doesn't strictly follow this order.

For now, I can use the concurrent_queue and my application behaves correctly, but it could be better.

As for the performance characteristics, to be honest, I'm just an end-user, I want it to have a minimal overhead et to perform as fast as possible:-)

Arch D. Robison (Intel)'s picture

Is the same set of threads doing the pushing and the popping? Or are there separate pusher and popper threads?


- Arch

Dmitry Vyukov's picture
alexandre.hamez: My needs are quite simple. All I want is to have a concurrent container that favor a First-In Last-Out order as it will avoid, for the project I currently working on, some unnecessary work. So it means that I don't care if it doesn't strictly follow this order.
For now, I can use the concurrent_queue and my application behaves correctly, but it could be better.
As for the performance characteristics, to be honest, I'm just an end-user, I want it to have a minimal overhead et to perform as fast as possible:-)



Devil in the details!
Full fledged multi-producer/multi-consumer stack is inherently slow.
If multi-producer/single-consumer stack is enough, then it can be implemented more efficiently.
If single-producer/single-consumer stack is enough, then it can be implemented even more efficiently.
If you don't need size() function, and is_empty() is enough, then it can be implemented even more efficiently.
If you are going to throw full-load of threads on it, then completely different algorithm must employed:
http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/Lock_Free.pdf
And so on...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
alexandre.hamez's picture

The details:

- Multiple producers/consumers (2-8 threads)

- pop_if_present() and push() are all I need.

- The size will grow to at most 500-1000 elements

I didn't think a concurrent stack was so complicated. I though it was just a matter of adapting a concurrent queue.

Dmitry Vyukov's picture
alexandre.hamez: The details:
- Multiple producers/consumers (2-8 threads)
- pop_if_present() and push() are all I need.
- The size will grow to at most 500-1000 elements

I didn't think a concurrent stack was so complicated. I though it was just a matter of adapting a concurrent queue.



Hmm... I think that in this situation it's difficult to create something substantially better than plain old mutex based stack...
Potentially producer side can be made lock-free, and consumer side still will be mutex based. I am not sure whether it's worth doing.
If strict lifo is not required than stack can be made distributed. I.e., for example, 4 different lifo stacks, producer and consumer choose stack at random on every operation. This will improve scaling on high load. But it's definitely not a general-purpose solution suitable for the library...


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Even without being substantially better (under favourable circumstances), a self-respecting concurrent stack building block would use atomic operations instead of a mutex to prevent convoying.

Dmitry Vyukov's picture
Raf_Schietekat: Even without being substantially better (under favourable circumstances), a self-respecting concurrent stack building block would use atomic operations instead of a mutex to prevent convoying.



How are you going to solve problem with safe memory reclamation wrt stack's pop() function?
You can end up being not "Even without being substantially better", but "being substantially worse", because all current state-of-the-art bullet-proof memory reclamation algorithms require one or more atomic RMW/#StoreLoad style membar per operation. Compare it with lock-based stack, which has only 1 XCHG per operation and devilishly simple.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Why would a lock-based stack have any less of a memory management burden than an atomic-based stack?

Anyway, since std::vector is not expected to shrink back from its high-water mark either, I would just push popped elements onto an internal reuse stack.

Dmitry Vyukov's picture
Raf_Schietekat: Why would a lock-based stack have any less of a memory management burden than an atomic-based stack?



Because in lock-based solution you can't get SIGSEGV while dereferencing pointer.

Raf_Schietekat:
Anyway, since std::vector is not expected to shrink back from its high-water mark either, I would just push popped elements onto an internal reuse stack.



Ok. It's a kind of memory management scheme too. So you end up having 2 atomic RMW... 2 atomic double-word RMW per operation. And in lock-based stack you would have 1 XCHG per operation.
Also how are you going to organize internal reuse stacks? Is they will be per user stack, per thread, per processor, global?

Btw, this solution will work only for stack algo, not for queues and other.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Thanks for teaching me that atomic operations are more expensive than locking and that locking implementations can do dynamic memory management for free.

Dmitry Vyukov's picture
Raf_Schietekat: Thanks for teaching me that atomic operations are more expensive than locking and that locking implementations can do dynamic memory management for free.



Sorry.
I've just been there done that, so to say. There are 'lock-free' algorithms that are better than locking ones wrt performance/scalability. But it's just not mpmc stack.

Btw! Do you aware of implementation of mpmc lock-free stack in Win32's InterlockedSList API?
It really has good performance... but some people are saying that it's way too dirty :)
It's intrusive stack, which always knows own size. On 32 bit machines it uses double-word CAS, on 64-bit machines it uses single-word CAS. On 32-bit machine anchor layout looks like this:
struct stack_anchor_32
{
 void* head;
 unsigned size : 16;
 unsigned aba_counter : 16;
};

On 64-bit machine it looks this way:
struct stack_anchor_64

{
 uint64_t head : 39; // extremely compressed pointer
 unsigned size : 16;
 unsigned aba_counter : 9;

};

To prevent accesses to freed memory they use following trick:

void* pop()
{
 void* result;
 int av_counter = 0;
retry:
 __try
 {
 result = pop_as_usual();
 }
 __except (if_access_violation)
 {
 if (++av_counter < 2000)
 goto retry;
 else
 rethrow_exception_to_user;
 }
 return result;
}


As practice shows, this algorithm has good performance as compared to other stacks, which not use "SEH trick".
But it's a kind of heavy OS dependent, in Unix environment you have to use signals here.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Sorry, but I don't follow you anymore. Now you say CAS is slow, now you don't... BTW, catching SIGSEGV is exactly what I'm proposing with __TBB_USE_MAGIC in the latest memory allocator-related "Additions to atomic" patch (rest of the story pending validation of that technique because it will probably be a decidedly non-trivial amount of work).

Dmitry Vyukov's picture
Raf_Schietekat:
Sorry, but I don't follow you anymore. Now you say CAS is slow, now you don't...




Sometimes I can't follow myself too :)
Assume we have 2 stack implementations. Former is blocking, but uses 1 atomic RMW per operation. Latter is wait-free, but uses 2 atomic RMW per operation. I can say that former is better, and I can say that latter is better I can say that former has some drawbacks as compared to latter, and I can say that latter has some drawbacks as compared to former. And all statements are true :) Which is better in general-case? I don't know. I don't even sure that such general-case wrt multi-threading exists...
But besides run-time properties lock-free algorithms also have inevitably associated much greater complexity of implementation. This fact further complicates problem with determining "the best" implementation.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
BTW, catching SIGSEGV is exactly what I'm proposing with __TBB_USE_MAGIC in the latest memory allocator-related "Additions to atomic" patch (rest of the story pending validation of that technique because it will probably be a decidedly non-trivial amount of work).



Provided that it has decidedly non-trivial amount of work and works only for stack, it looks like it's not worth doing. Or you are going to somehow make it work for other data structures (queues, lists, maps etc) too? If so, I am very curious how exactly you are going to make it work for other data structures too. Can you elaborate, please?


All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Just have a look at the latest "Additions to atomic" patch. I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program (I have not yet taken the time myself): if it can be reproduced that __TBB_USE_MAGIC has no significant overhead in a real program (anyone from Intel?), I can be more confident not to be wasting my time tying up the loose ends, which would save a not altogether insignificant amount of real and/or virtual memory on top of any benefit from the other proposed changes.

Dmitry Vyukov's picture
Raf_Schietekat: Just have a look at the latest "Additions to atomic" patch. I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program (I have not yet taken the time myself): if it can be reproduced that __TBB_USE_MAGIC has no significant overhead in a real program (anyone from Intel?), I can be more confident not to be wasting my time tying up the loose ends, which would save a not altogether insignificant amount of real and/or virtual memory on top of any benefit from the other proposed changes.



test_malloc_compliance looks more like synthetic benchmark, not as representative of a real program.
But if component is better in synthetic benchmark, then with substantial probability it will be better in real program too. But for memory allocators there are such things as fragmentation. I don't know.
If you want real programs then I think you can take some of the standard benchmarks for memory allocators. I think a number of such benchmarks must be out there. Or you can take benchmarks used for other allocators. Every academic paper on allocator contains description of mini-benchmark. For example:
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

BUT I still don't see how those magic numbers can help lock-free algorithms. What it's all about?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Alexey Kukanov (Intel)'s picture
Raf_Schietekat: I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program


Sorry, I missed this request during initial reading. test_malloc_compliance is just a test; it does not represent a real program, and not even a synthetical microbenchmark.

Raf Schietekat's picture

"test_malloc_compliance looks more like synthetic benchmark, not as representative of a real program." It is obvious that a real program will behave very differently, but I don't see any performance penalty whatsoever, which is rather surprising.

"BUT I still don't see how those magic numbers can help lock-free algorithms. What it's all about?" Saving memory.

Oh well, if not right now, soon enough somebody will be interested enough to steal this task from me.

Login to leave a comment.