Lock-Free using cmpxchg8b...

Lock-Free using cmpxchg8b...

imagem de Community Admin

Your threading tips are sound. The thread aware memory pools are nice. =)


Here are some of my advanced threading tips: =)


You should spice up your threading site by adding a:

" Rock-Solid Lock-Free Algo's /w cmpxchg8b & cmp8xchg16b Tutorials " additon to your site.


You Intel guys really need to promote the cmpxchg8b and the cmp8xchg16b ( for Merced I think ) opcodes. When you code lock-free algo's using these, they make advanced threading a work of art. They are not just for creating locks you know. =)


I have implemented a 100% ABA proof lock-free IBM FreeList with your cmpxchg8b opcode:

FreeList.c


Joe Seigh and I have developed A 100% ABA proof " real " lock-free atomic pointer and I did a lock-free fifo queue with it:

AtomicPtrABATest.zip

The AtomicPtr combines 100% ABA proof lock-free algos with the ability for a thread to reference a shared pointer without having to already own a marshaled copy!


Also, look at my advanced fifo queue code:

Queue.c

This code is very " thread " friendly. Put this code up against a " normal " fifo shared queue, and it will blow it away.

You should add some similar code examples on the Intel threading sites, or just add a link to my page. ;)


What do you threading masters think about it?

;)

Thank you.

31 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de netdevil

Hmm,

I've used locks (as in "lock xchg" and "lock inc", et al), but I'm not familiar with cmpxchg8b and the likes. Can someone explain them to me in a little more detail, and how best to use them?

Thanks!
-Dale

imagem de Community Admin

This paper describes some rock-solid lock-free collections ( LIFO & FIFO ) that can be used for very SMP friendly inter-thread & process communication.

Lock-Free Collection PDF

After reading that, you should be on track.

Now? How to implement those lock-free alog?s for Pentium?s, Xeon?s, and Itanium processors?

That?s where the cmpxchg8b and cmp8xchg16b opcodes come in to play.

They are CAS operations that swap memory sizes double that of the system pointer size. Very usefull indeed.

You can use this to swap a pointer in a fashion that avoids the ABA problem that the paper talks about. You do this by atomically setting a read ABA id on the target shared memory, and comparing the read id on any CAS on the shared memory. It turns CAS into a LL/SC.

In fact, the Windows API has SList's. They happen to use the exact same lifo algo the paper describes. So Microsoft does use lock-free stuff in there kernals, and even export one for the " general " public. =)

Here are some 64-bit compare and swap's from my library you can use to implement the lock-free algos for 32-bit systems:

Atomic.h

The #ifdefs should be self-explanitory.

A 64-bit version ( Itanium ) would use the cmp8xchg16b opcode.

Lock-Free are what SMP systems thrive on. ;)

PowerPC's can use the algo's with LL(reserved)/SC, any processor with a LL(reserved)/SC can do lock-free collections.

imagem de Clay Breshears (Intel)

lockfree -

Thanks for posting a pointer to the paper. It helped quite a bit in understanding what you have been proposing. I had not heard of the "ABA problem" before, but this gave a good example. Of course, gaining understanding in one aspect of a topic usually brings up new questions and I'd like to post those here for anyone to answer.

But before I get to that, let me first get nitpicky. The first sentence of the paper starts with the definition: "A shared data structure is lock-free if its operations do not require mutual exclusion". I would amend that by saying "...do not require separate mutual exclusion objects to control access" since the shared queue and stack data structures clearly need mutual exclusion to get proper functionality. However, the methods that are used in the paper don't need any additional synchronization object to enforce mutual exclusion.

Back in the old days, when I was first starting my computer science education, we learned about the test and swap instruction. The compare and swap (CAS) atomic operation described in the paper is a generalization of that with the ability to compare against any value and assign any value into the memory location, rather than testing for zero and setting a '1'. I did wonder about the CAS2 operation described to assign two values atomically. Should there not be a second memory location as one of the parameters? Otherwise, where does the instruction know where it must assign the second value? Or have I missed something in the description?

My second question is whether these methods could be used in something other than a linear data structure like the LIFO and FIFO examples? Could we use them to protect elements from an array of shared structs? It seems to me that the technique is a double check algorithm that uses the fact that only a single element (a head or next pointer) needs to be modified in the existing structure. If I needed to modify several parts of an object (e.g., x-, y-, and z-coordinates, speed, plus other physical properties of something moving through space), could a lock-free mechanism be created to protect access to this struct?

I haven't really devoted any time trying to come up with an answer. I was hoping that someone more familiar with the ideas of lock-free techniques would have already done something like this and can share the more general method(s) with all of the Threading Forum participants.

-- clay

imagem de Community Admin

Lock-free algo's do make threads contend, but the overhead of lock-free contention is not even close to the likes of a critical section. ;)

I consider a shared object lock-free if:

1. It doesn't use any OS provided critical sections.

2. It doesn't make contending threads wait in a line.


The cmpxchg8b instruction swaps into a 64-bit destination. So, if you pass cmpxchg8b two 32-bit values to swap into the destination. One will be the low, and one will be the high part of the 64-bit destination.


It is useful in swapping pointers... If it was an array of pointers then you could CAS the cells to new ( updated ) objects. I haven?t worked with lock-free algo?s that implement a dynamic-array yet.


Here is a paper that I find really useful for accessing large collections ( arrays, whatever ) concurrently:

MSDN Magazine, Concurrent Access

I use the papers rock-solid method to protect data when I can?t think up a lock-free solution. =)


The lock-free collections that I presented are very good at inter-thread comm.


I don?t work really work with Gaming Servers, but?

Say you had a simple SMP Geometry Server, which had two thread pools. A Geometry and a Socket thread pool.

The Socket Pool received incoming data about client character moves,and other various stuff. The Socket Pool parses the messages, and sends them to the Geometry Pool via. lock-free FIFO queue for some math.

The Geometry Pool does some collision detection, updates world objects around the player moves, ect? The sends back world information to the Socket Pool via. lock-free collection.

The Socket Pool then distributes the updated world data to it?s connected tiers.

Well, that would beat the pants off ANY other SMP system that used ? normal, blocking ? FIFO queues for updating world objects under load.

=)

Does that sound useful for a game server at all?

imagem de Community Admin

Mr. lockfree,
I read your article, but looks like your code examples are not thread-safe. Here is the lifo-pop:

lifo-pop (lf: pointer to lifo): pointer to cell
SC1: loop
SC2: head = lf->top # get the top cell of the lifo
SC2: oc = lf->ocount # get the pop operations count
SC3: if head == NULL
SC4: return NULL # LIFO is empty
SC5: endif
SC6: next = head->next # get the next cell of cell
SC7: if CAS2 (&lf->top, head, oc, next, oc + 1) # try to change both the top of the lifo and pop count
SC8: break
SC9: endif
SC10: endloop
SC11: return head
Figure 9: lifo-pop catching the ABA problem

It looks like the line SC6 can generate exception in multithreaded environment, if some other thread will assine NULL to the head between SC3 and SC6. If you will fix the example by putting the line SC6 inside try..catch construct, it will work for sure without loosing the performance.

The same problem exactly I found in your fifo-pop example.

imagem de Community Admin

One more note for Mr. lockfree:
Proposed algorithm can work only on SMP systems with shared data cache. I think this also should be noted in your work.

imagem de Community Admin

You have no idea about what you are talking about.

You need to learn about lock-free algo's.

You need to learn about how lock-free algo use memory.

You need to learn about how to use lock-free algo's correctly.

You need to learn why you can't free nodes in lock-free algo's without atomic_ptr or SMR.

IBM SMR Algo

AtomicPtr.c


The algos ARE THREAD SAFE period, Microsoft API exposes a lock-free LIFO stack, the SList API:



Here are both the algos in the paper i cited in working code:

FreeList.c

FreeQueue.c

You don't need a SMP box to runs these, what in the world made you think that bulls%I^?

Before you trash a set of algo's that have been working for many years, you must do some homework. Or you risk looking like a fool.

imagem de Community Admin

Whoops, I forgot to add the Microsoft SList link...

Interlocked SList

Disassemble the SList API on an IA-32 processor, and you will see that it is lock-free. It also uses the exact algo in the paper I cited.

I am sorry if I came across as an ass. I think I did.

I thought that you said there was a bug for sure, instead you were actually asking me if there might me a bug.

Sorry for misreading your post.

imagem de Community Admin

To Mr. lockfree

As one that already coded industry-level multithreaded algorithms for many-many years

:-) :-) :-)

imagem de Community Admin

By the way, did you check your algorithms with Intel Thread Checker ?

imagem de Community Admin

> What do you threading masters think about it?

Take a C++ class, Sender. ;-)

regards,
alexander.

imagem de Community Admin

How ya doin Alex. =)

You should post a link to your portable reference counting standards here.

I?ve been using C for ever ( I even use it for light-weight COM from time to time ), and my C++ is a bit odd I think. However I do use C++ to quickly get a project concept up and running from time to time. Then I re-write the production code with C.

;)


To kdmitry:

? thread checkers ? and lock-free algos don't go to well together. ;)

For one, they will all probably detect the error you think you found. When in fact, its completely normal behavior for a lock-free algo. Read the first couple of pages on the SMR algo, and it will try to explain how lock-free algos use memory.

Basically? If one thread deletes a node, there may be other threads using that node inside their CAS sections. So a thread checker would not like that. =)

You can use SMR or AtomicPtr to overcome this side effect of most lock-free algo?s.

Again, I am sorry for my earlier post.

imagem de bronx

I've rapidly checked your examples, it appears that in order to serialize the access, threads wait in a spin-loop until the extra "anti ABA" count data member is OK (I'll call this very bevahior a *lock* but that's another story).

With many threads competing for the same data structure these spin-loops are likely hotspots at runtime and probably using the PAUSE instruction will lead to speedups when runing on HT capable P4s. For forthcoming Prescott targets have a look at MONITOR/MWAIT in PNIs they should allow you to optimize your code without wasting all these cycles waiting in loops to aquire the lock (or "counter" if you prefer this name).

btw you use a mfence instruction in the SSE compilation path, it has potentially major side effects on other applications that make heavy use of streaming stores or write to WC/non-cachable memory like a gfx card framebuffer, do you really need these memory barriers ? if so, it looks like a major drawback of your code

imagem de jseigh2

The defintion of lock-free is that a thread can make forward progress
without requiring any forward progress by other threads. If a thread
is not waiting for a specific value to be set by another thread then
it is not blocking. There may some looping due to contention but
usually the hardware synchronization guarantees forward progress
in the presence of contention up to a certain level, and you can
control the level of contention by back offs if that becomes an issue.

What is the MONITOR/MWAIT stuff? Pause until a memory location
is changed? I've always thought it would be useful to have alpha
like addition to its LL/SC, a pause condtional that would wait until
the reserve was broken by a store by another processor. It would
be interruptible of course and should optionally raise an interrupt
in problem state so that a time slicing supervisor could intercept on
it and dispatch another thread.

Joe Seigh

imagem de bronx


>If a thread
> is not waiting for a specific value to be set by
> another thread then
> it is not blocking.

sure, in the french paper posted by lockfree they fix what they call the "ABA problem" by adding new data members ("ocount", "icount") tentatively set using atomic compare&swap along with the pointers stuff. Until count and pointer are updated successfully, threads wait in a loop. The major problem here (particularly on SMP systems) is that at each iteration of the loop a snoop request is issued in order to check if the value has been changed by the other CPU. It leads to 128-byte L2 cache lines exchanged accross the system bus like mad. Inserting a PAUSE instruction in such cases is advised and should provide some speedups.


> What is the MONITOR/MWAIT stuff? Pause until a
> memory location
> is changed?

yes, exactly. Information is still scarce,

(see a ref. here :) PNIs

but we can guess that instead of looping wasting execution slots and bus bandwith, the waiting threads will sleep until some criterion is met. They will be woken up immediately, just in time without the penalty of exiting a loop (quite high on P4s with a 20-stage pipeline)


Message Edited by intel.software.network.support on 12-09-2005 02:09 PM

imagem de Clay Breshears (Intel)

> sure, in the french paper posted by lockfree they fix
> what they call the "ABA problem" by adding new data
> members ("ocount", "icount") tentatively set using
> atomic compare&swap along with the pointers stuff.
> Until count and pointer are updated successfully,
> threads wait in a loop. The major problem here
> (particularly on SMP systems) is that at each
> iteration of the loop a snoop request is issued in
> order to check if the value has been changed by the
> other CPU. It leads to 128-byte L2 cache lines
> exchanged accross the system bus like mad. Inserting
> a PAUSE instruction in such cases is advised and
> should provide some speedups.

When I first read the french paper, I too was concerned about spin-waits affecting performance and thought to recommend using PAUSE for HT implementations. Of course, there may be some unwritten requirement that the data structures don't have much contention in the first place. That is, even with hundreds or thousands of threads, there may only be two or three at any time trying to update the data structures.

In any event, one should hope that the lock-free objects would have low contention to justify the few spin-waits that would be needed while not paying the heavy cost of using some synchronization object and attendant API. With most everything in life there would be some tradeoff point where use of a sync object would yield better throughput than the lock-free implementation. I'm guessing that this would be some function of number of threads, size of objects, contention ratio, etc.

-- clay

P.S. Thanks for all remaining (mostly) calm. These lock-free objects are new ideas for most people and those actively espousing their virtues will need a bit of patience until the rest of us catch up with their level of understanding. :-)

imagem de Clay Breshears (Intel)

I'm embarrassed to admit it (esp. since I work for Intel), but I'm having a bit of trouble locating any documentation on the cmp8xchg16b instruction as proposed by lockfree. I've seen some contradictions in some of the papers that have been pointed to in this thread and was hoping to see what the details were on this instruction that might confirm some assumptions that I've made on this topic.

I haven't dealt with assembly language in any depth since I was programming 6502-based Apple IIs way, way back. ("In my days we didn't have any fancy text editors or hoity-toity Windows. To correct a program back then you had to punch new holes in Hollerith cards with a dull pocket knife...and we liked it!" :-) So, I'm not up on the latest documentation for assembly language instructions.

Thanks in advance.

-- clay

imagem de jseigh2

cmp8xchg16 is documented in the Itanium architecture software developers manual. It's basically a compare 8 bytes, swap 16 bytes. There's some question as to why cmpxchg8b was not ported to Itanium as compare 16 bytes, swap 16 bytes. atomic_ptr uses a doubleword compare and swap. While I can port atomic_ptr to Itanium, there's two or three things I can do, it's giving lockfree(aka sender) fits as he's trying to put into atomic_ptr some ABA free stuff. While I don't think there's any benefit to having the ABA stuff in atomic_ptr, I don't want to implement atomic_ptr in a way that makes it impossible for him to do that if that is what he wants to do.

imagem de Clay Breshears (Intel)

> cmp8xchg16 is documented in the Itanium architecture
> software developers manual. It's basically a compare
> 8 bytes, swap 16 bytes.

Thanks, Joe. This is what I thought it was from the name. I guess this corresponds to what the french paper defines as CAS2. I still can't see how this would work, though, as intended. The idea about avoiding the ABA problem is to compare (and exchange) both the pointer involved and the count of operations. If the pointer has ended up with the same value even after other threads have updated the structure, the count of operations will have changed to inform the original thread that it must recheck the pointer and counter values.

Maged Michael's paper from PODC 2002 points out that DCAS (his name for CAS2) needs to make the comparison of both values before instigating the exchange. This was as I thought, but did not seem to be laid out in the french paper. Unless, I thought, the memory layout was working with the CAS2 operation. That is, given a single memory address that was actually going to refer to a double word location. So, you would need to place the pointer and the reference count into adjoining memory locations, but could use separate (8-byte?) values for comparison and exchange. That made some sense, but I would think it would be better designed to have the same three arguments as the simple CAS, but double the size of the arguments which would require the user to place values in the hi and lo parts of the double area. Still, since this was only an idea in the french paper to solve an algorithmic problem, anything would be feasible.

Unfortunately, the cmp8xchg16b looks to only compare the first 8 bytes, but exchange 16 bytes based on that comparison. This seems to be the same as my first interpretation on the CAS2 primitive. If so, does this not suffer from the same drawbacks and not give the direct solution to the ABA problem that was given in the french paper? Since Michael states that the DCAS is not a primitive implemented on any architecture (and I don't think the cmp8xchg16b does a proper job of it), does he have the more realizable ABA avoiding algorithm?

I guess I too wonder why there was no cmpxchg16b implemented on Itanium. This would seem to have solved all sorts of extra problems that the 8 byte compare can't handle easily. I suppose we'll have to poll someone involved with Itanium development to find the answer.

-- clay


imagem de jseigh2

DCAS is not the same as CAS2. DCAS works on 2 discontiguous words in memory. DCAS is the name this instruction has on the Motorola 68020 and 68030. CAS2 usually refers to a contiguous doubleword (not to be confused with DWORD) compare and swap. If the word size is 64 bits, then CAS would work on 64 bit operands and CAS2 would work on 128 bit operands. The mnemonics are not standard in this area, so it is easy to get confused when reading any papers on this topic.

The Itanium cmp8xchg16 will work for lock-free stacks if you increment the change count on both the push and pop operations, not on just one like you could do with CAS2. But it does break some algorithms that do require a compare on both words.

I just noticed Microsoft's latest documentation on their Interlocked Singly Lists. It says

Singly linked lists (SLists) are straightforward to implement and use in
32-bit code. However, it is challenging to implement them in 64-bit code
because the amount of data exchangeable by the native interlocked
exchange primitives is not double the address size, as it is in 32-bit code.


There's no problem with this on the Itanium, but I don't think Microsoft is too happy with a certain other 64 bit architecture that they've agreed to port to. :) It's not so much a case of lock-free being lock free, although that's nice. It's a question of async safety, meaning you can use it in interrupt handlers/callbacks. If you use locks to do an SLIST implementation, it's no longer async safe and code that was previously depending on this is now broken. If Microsoft had used non invasive list links (SLIST_ENTRY) they could work around it, but they didn't.

Good luck on getting answers from the architects. They usually don't like publishing the architecture rationales. All the ones I've participated in, they've always stripped that stuff out. I'd end up thinking that if I hadn't been at those meetings then I would have no idea how to use that particular feature.

Joe Seigh

imagem de Community Admin

> The Itanium cmp8xchg16 will work for lock-free stacks
> if you increment the change count on both the push
> and pop operations, not on just one like you could do
> with CAS2. But it does break some algorithms that do
> require a compare on both words.

My new lock free system will work with a normal cmpxchg or ll/sc on a 32 or 64 bit system. It also will work great with cmp8xchg16.

The system provides nodes that can be used in most lock-free algos. I have a lifo & fifo on my site, and I plan to add some more.

Also?

If you use my new algo with a QWORD CAS, you can compare and swap two nodes at once. That will make it easier to create lock-free double-linked lists!

;)

What do ya think of that?

> However, it is challenging to
> o implement them in 64-bit code
> because the amount of data exchangeable by the
> e native interlocked

We must disassemble the SList API on a 64-bit AMD, then on an Itanium II. That would be interesting indeed!


=)

imagem de bronx

> objects would have low contention to justify the few
> spin-waits that would be needed while not paying the
> heavy cost of using some synchronization object and
> attendant API. With most everything in life there

ok, but how about a general purpose lightweight lock, like one based on inlined code without call to the OS ?
a "cmpxchg8b" based loop will do the trick, for the stacks and queues examples it will require one CAS loop on entry and another CAS loop on exit (+ ultra simple code in between, *easily maintainable*), instead of a single convoluted CAS2 loop, I will be very interested to compare actual timings (btw how can we implement CAS2 for IA-32 targets ?)

imagem de Community Admin

cmpxch8b implements the french papers CAS2.

Here is an older FIFO queue algo that I have working on my site:

Lock-Free FIFO Algo

It?s the well-known fifo mike and scott algo.



The lock-free version will beat a spin-lock protected version.

You don?t want to have threads wait in a line to access the queue.

You want all the threads in at the same time, each trying to push or pop nodes on each CAS2.



I have implemented my new lock-free system that only requires a normal CAS ( InterlockedCompareExchange ) to work, which makes it portable to any processor that has a normal CAS, or a normal LL/SC.

LockFree.c

FreeQueue.c

FreeList.c

Its proving to be perfect for balanced consumer / producer communication.

A quick description for my new algo is at the root of this thread here:

New Lock-Free Algo



Also,

You don?t need that mfence opcode after the cas?s I have in Atomic.h.

One needs a release barrier on success.

The other needs a release on success, and an acquire after failure.

imagem de ravenous

Hello lockfree, i tried accessing your codes to test out your lockfree algorithms but apparently the links are down. Is it possible to update the links. Tks.

imagem de Community Admin

> Hello lockfree, i tried accessing your codes to test
> out your lockfree algorithms but apparently the links
> are down. Is it possible to update the links. Tks.

Here is the home page:

AppCore

All of it is in plain Win32 C.

The algo's currently work, but I have updated some of them. I will post the updates soon...


Also, I am going to start coding some lock-free pthread compatible objects...

Here is my lock-free semaphore, that uses pthreads:

lfsema.cpp

lfsema.h

atomic.h


Here is a pthread test program for the semaphore:

main.cpp

thread.cpp

thread.h

queue.h

mutex.h


Enjoy!

;)


Please comment on them if you want...

imagem de ymchew

Dear Lockfree,

Hi, I am relatively new to this research area. As i am involving in a lock-free techniques related research project recently, I thus start to learn regarding this.

I would like to ask if anyone of the users here, especially Mr. Lockfree could share the idea on how to implement a lock-free linked-list using CAS. I am facing difficulty in finding related reference material as this area is still quite new. I would like to enquire the independency of CAS on different processor platform as well.

I would be appreciate if any one of you could provide some materials to enhance my understanding regarding this lock-free techniques.

Thanks in advance.

Best Regards,

imagem de Community Admin



> I would like to ask if anyone of the users here,
> , especially Mr. Lockfree could share the idea on how
> to implement a lock-free linked-list using CAS.

Welcome to state-of-the-art contention handling!

;)



Here are some links that I posted to c.p.t, where I am known as SenderX:

2 Lock-Free Linked Lists


> I am
> facing difficulty in finding related reference
> material as this area is still quite new. I would
> like to enquire the independency of CAS on different
> processor platform as well.

Any processor that has a CAS or LL/SC that operates on memory sizes >= sizeof( void* ) can work for lock-free linked-lists. The IBM SMR algo makes this statement true. I am currently working on a Win32/PThread SMR algo.

I will probably post it to c.p.t.



You should think about using a memory reclamation schema with your lock-free lists:

Here is a link to my personal lock-free proxy garbage collector and stack, that can be iterated in a 100% lock-free manor!:

Lock-Free Garbage Collector & Stack


Here is a link to another VERY useful lock-free garbage collector:

IBM SMR GC Algo

This algo. will elimate the ABA problem, and allow for normal CAS or LL/SC.

It also clearly explains why a lock-free garbage collector is needed for "Dyanamic/Non-Static" lock-free algos.



I am working an a very portable, 100% lock-free, FAST, linked-list that allows for a Push, Pop, and Iteration to execute in parallel!

I might post it here...

=P


P.S.

Lock-free algos, are being used more and more. Java and the linux kernel are a few. Win32's SList API is also lock-free.

Read here how a lock-free algo. vastly increased the Java Dictonary peformance and the linux kernel:

Java Lock-Free

Linux Lock-Free


Any questions on lock-free algos are on topic here, and are encouraged...

Enjoy!


Message Edited by intel.software.network.support on 12-09-2005 02:21 PM

imagem de ymchew

Dear Lockfree,

I am appreciate your reply as it further enchance my understanding in regarding area. However, in my very first stage of my research project, I am asked to implement a lockfree linked list on a 6-CPU based server, powered at Pentium Pro 200MHz, in Linux OS environment, preferably.

In the CAS code that you posted, I realized that it might be suitable for 64-bit processor eg. Itanium or etc. I am wondering if the CAS32 code availale for reference.

The other question is once I have the CAS32 code, how could I link it in my C coding environment of the linked-list.

Please enlighten me. Thank you very much.

Regards,

imagem de danmay


Your links below to source code are dead. Do you have updated links or


can you e-mail me the code ?




imagem de Chris M. Thomasson

Faça login para deixar um comentário.