A optimization problem on atomic<T>

A optimization problem on atomic<T>

The following is a producer/consumer example.

atomic in, out;
in= 0;
out= 0;

P1P2

while ( (in + 1) %capacity== out);while ( in == out ) ;

/*do something*/ /* do something */

in = (in + 1) % capacity; out = (out + 1) % capacity;

As you can see, in (out) only modified by P1 (P2).

I am wondering if there is any possible benefit in which we replace in (out) with an ordinary variable. So that the above code become,

P1 P2

const size_t local_in = (in + 1) % capacity;const size_t local_out = out;
while (local_in == out); while ( in ==local_out ) ;

/* do something */ /* do something */

in = local_in; out = (local_out + 1) % capacity;

Any comment will be appreciate.

36 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting azru0512

I am wondering if there is any possible benefit in which we replace in (out) with an ordinary variable.

Well, for example, if you want to perform an act of sabotage, then there is a benefit of producing subtly broken program people may spent months fixing it. Anything else?.. I do not see.

If you care about performance your atomics library must be flexible enough to not penalize you for no reason. I.e. produce optimal machine code so that any more "optimal" code is broken.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting Dmitriy Vyukov
If you care about performance your atomics library must be flexible enough to not penalize you for no reason. I.e. produce optimal machine code so that any more "optimal" code is broken.

What you are asking is basically: I need to write some data to file, but it is slow, is there any benefit in not writing it to file? Sorry, if you need to write some data file, you need to write it to file, there is not way to get around it.

The same for atomics. There are not to penalize you, they are to provide functionality your program *requires*.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

So you mean just stick to atomic, don't think about what I am doing in the example above?

Quoting azru0512
So you mean just stick to atomic

If you are doing muti-threaded programming you need to stick to some synchronization primitives that will provide synchronization your program requires. Atomics are one of them. Plain C/C++ variable are NOT one of them.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

So the first one is OK, the second one could be dangerous? I mean the above example.

Quoting azru0512
So the first one is OK, the second one could be dangerous? I mean the above example.

Yes. The most innocuous (from the point of view of run-time behavior and time spent localizing the problem) outcome for the second program is that it will instantly hang.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Could you explain how come it will hang? I cannot figure out how this could happen. Thanks.

Quoting azru0512

Could you explain how come it will hang? I cannot figure out how this could happen. Thanks.

It will hang if threads will not reload 'in' and 'out' during spinning.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

while ( local_in == out) ;

In the statement above, spinning only occurs when thread not reload "out". But how this could happen?Declaring variable "out"as atomic does not help?

> In the statement above, spinning only occurs when thread not reload "out".

No, in the statement above spinning occurs while local_in != out.

For it to be correct, thread must reload 'out' from memory on every iteration. Otherwise (out is cached in a register) it can hang.

> But how this could happen?

This happens by default. You should ask: what I must do to to prevent it?

Declaring 'out' as atomic prevents this. All sane atomics implementations explicitly or implicitly guarantee eventual visibility of changes.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I repost the example code here.

atomic in, out;

in = 0; out = 0;

P1 P2

const size_t local_in = (in + 1) % capacity; const size_t local_out = out;

while ( local_in == out )while ( in == local_out )
; ;

/* do something */ /* do something */

in = local_in; out = (local_out + 1) % capacity;

The statement "while ( local_in == out ) ;" should be spining when "local_in == out" if I don't misunderstand what "while" statement means. Besides, maybe you forget that I already declared out as an atomic variable.

Sois above program still problematic? Thanks.

Quoting azru0512

Sois above program still problematic?

As far as I see, it is Ok.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Thanks.

And I still want to ask that this kind of usage (i.e., const size_t local_in = in) has any impact on performance?

Or just as you said, stick to atomic is just fine. In other word,forget about "const size_t local_in = in", just use "in" all the way.

Quoting azru0512
Thanks.

And I still want to ask that this kind of usage (i.e., const size_t local_in = in) has any impact on performance?

Or just as you said, stick to atomic is just fine. In other word,forget about "const size_t local_in = in", just use "in" all the way.

Ah, I see, I guess we have some misunderstanding on the meaning of your original statement "I am wondering if there is any possible benefit in which we replace in (out) with an ordinary variable".

I interpret it as you want to replace:

atomic in, out;

with

size_t in, out;

It is NOT Ok.

And it seems that you want to replace:

while ( in == out );

with:

const size_t local_out = out;

while ( in == local_out );

(right?)

It is indeed Ok.

Regarding performance benefit of such replacement, I think there can be some benefit. Read of atomic variable will most likely result in load from memory subsystem, while local variable will most likely be cached in a register.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Yes. Finally we understand each other. : )

And I still have aquestion with atomic. Consider the following producer/consumer example,

atomic in, out;
in = 0;
out = 0;

Producer Consumer

voidproduce(char *data) { void consume(char *data) {
while ( (in + 1) % capacity == out) ; while (in == out) ;

buffer[in] = data; data = buffer[out];

in =(in + 1) % capacity; out = (out + 1) % capacity;
} }

As far as I know, most code pattern uses read-acquire/store-releaselike lock/unlock pair. For example,

read-acquire

memory operations

store-release

Above pattern work like a critical section, it prevent memory operations between read-acquire/store-release frominside out.

But in the above producer/consumer example, we see there are some read-acquire in between (e.g., buffer[in] = data;).

So my question is:

Istheabove example OK? And what constraints or rules we should apply on those memory operationsbetween read-acquire/store-release pair?

How 'data' is declared?
Do you mean TBB atomics here? Frankly I do not remember their exact semantics, and always had problems understanding code that uses them - laconism and implicitness are the last things I want to see in atomics library. I prefer to stick to C1x atomics lately - they are more flexible and explicit.
Is value evaluation of tbb::atomic load-acquire? It store to tbb::atomic store-release?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Yes, I mean TBB atomic here. TBB atomic associates read/write (load/store)with acquire/release fence. I have checked TBB online reference.

Whatthe main differences between TBB atomics and C1x atomics?As far as I can tell, C1x atomics have default fullfence which garanteesequentially consistent.

And how "data" is declared has anyinfluence on the correctness?

> Whatthe main differences between TBB atomics and C1x atomics?

They are more flexiable and explicit.

> As far as
I can tell, C1x atomics have default fullfence which
garanteesequentially consistent.

I prefer to use _explicit form, that requires explicit indication of memory ordering.

> And how "data" is declared has anyinfluence on the correctness?

But of course. It is 'data' that synchronizes data transfer between threads.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"But of course. It is 'data' that synchronizes data transfer between threads."
Would you like to "rephrase" that? :-)

But of course. It is 'data' that synchronizes data transfer between threads.

Producer and consumer should operate on different buffer slots, right? Producer inserts data into buffer[1], and consumer retrieves data from buffer[2], for example.

Then, whatyou mean exactly about "how 'data' is declared"?

Quoting azru0512
But of course. It is 'data' that synchronizes data transfer between threads.

Producer and consumer should operate on different buffer slots, right? Producer inserts data into buffer[1], and consumer retrieves data from buffer[2], for example.

Then, whatyou mean exactly about "how 'data' is declared"?

Provide complete program code, then we will be able to discuss further. If it would be C, then a piece of code possibly would be enough. But what does "data[in] = d" mean in C++ only God can possibly know.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitriy, Ithink that youdidn't make the link between "data" in the third-last line of #15 and the example code above, where it is a function parameter declared as "char*", otherwise you would probably have responded differently.

IMHO, the read-acquire in "buffer[in] = data;" is irrelevant, just as the read-acquire in "in = (in + 1) % capacity;" (right-hand side), it's the release-store in "in = (in + 1) % capacity;" (assignment) that matters. Throwing in extra releases or acquires only affect performance (adversely, if at all), not semantics (as long as the program is otherwise correct).

(Rephrased in the second person, and I apparentlylost therace with #21.)

So the example code just work fine, right? I mean work correctly, of course.

"So the example code just work fine, right? I mean work correctly, of course."
The code in #15? No, "data = buffer[out];" does nothing useful. Maybe you mean "*data=*buffer[out];", or "strcpy(data, buffer[out];"? Then yes, probably, unless I missed something.

(Added) Although... let me meditate on it a bit longer. In the meantime, Dmitriy can put it through Relacy. :-)

Quoting Raf Schietekat
Dmitriy, Ithink that youdidn't make the link between "data" in the third-last line of #15 and the example code above, where it is a function parameter declared as "char*", otherwise you would probably have responded differently.

Ah, youre right. I meant 'buffer', not 'data'.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Well, bufferwill be declared as follows,

char *buffer[32];

And producer/consumer inserts/retrieves a "pointer" into/from the buffer.

But what does "buffer[in] = data" mean in C++ only God can possibly know.
(I change the sentence a little bit.)

Just curious, why you said so?

#26 "And producer/consumer inserts/retrieves a "pointer" into/from the buffer. "
As I wrote in #24, the consumer doesn't do anything useful: you must instead return "data" as a return value or as an out parameter (by reference or by pointer), or use the referent inside the consumer.

#27 "Just curious, why you said so?"
C++ has objects...

#24 "(Added) Although... let me meditate on it a bit longer. In the meantime, Dmitriy can put it through Relacy. :-)"
Hmm, so P2 has to read a buffer location (and its referent if that may be reused) before P1 overwrites it (and possibly the referent) on the next round. If we assume a processor that can reorder loads, how is the load forced to occur on time? Is there a load-store associated with "out=(out+1)%capacity;"? I see load-load for reading "out", and store-store for writing its new value, but no load-store, and once "out" is written P1 is free to overwrite the data, right? So does that mean rel_acq (or acq_rel) for the atomic operations, or even sequential consistency? It would also mean that "data" cannot be just returned if its referent can be reused, it must all be treated before "out" is updated. Dmitriy, am I seeing ghosts, or maybe not?

Quoting azru0512

But what does "buffer[in] = data" mean in C++ only God can possibly know.
(I change the sentence a little bit.)

Just curious, why you said so?

operator[] can be overloaded

operator= can be overloaded

some implicit conversion functions can take place

some fancy temporary helper objects can be created

w/o knowing exact types and their definitions the expression can mean basically everything in this world

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting Raf Schietekat
Dmitriy can put it through Relacy. :-)

I do not have time for that right now, but everybody is free to do that manually.

http://groups.google.com/group/relacy

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting Raf Schietekat
Hmm, so P2 has to read a buffer location (and its referent if that may be reused) before P1 overwrites it (and possibly the referent) on the next round. If we assume a processor that can reorder loads, how is the load forced to occur on time? Is there a load-store associated with "out=(out+1)%capacity;"? I see load-load for reading "out", and store-store for writing its new value, but no load-store, and once "out" is written P1 is free to overwrite the data, right? So does that mean rel_acq (or acq_rel) for the atomic operations, or even sequential consistency? It would also mean that "data" cannot be just returned if its referent can be reused, it must all be treated before "out" is updated. Dmitriy, am I seeing ghosts, or maybe not?

As far as I see everything is Ok here.

If you are thinking in term of bidirectional fences, then acquire=#LoadLoad|#LoadStore, and release=#LoadStore|#StoreStore.

> Hmm, so P2 has to read a buffer location (and its referent if that
may be reused) before P1 overwrites it (and possibly the referent) on
the next round. If we assume a processor that can reorder loads, how is
the load forced to occur on time?

The load happens-before store-release to 'out'. And P1 can overwrite the location only after load-acquire of 'out'.

Is there a load-store associated with "out=(out+1)%capacity;"?

It should be there. It's store-release.


All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting azru0512
But in the above producer/consumer example, we see there are some read-acquire in between (e.g., buffer[in] = data;).

So my question is:

Istheabove example OK? And what constraints or rules we should apply on those memory operationsbetween read-acquire/store-release pair?

As far as I see, it's Ok.

Load-acquire in between is superfluous, you may cache 'in' in local variable and do not reload 'in' several times.

As for rules, it's too general question, and the general answer is you can use any operations as far as code stays correct (no data races, intended behavior, etc).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If you are thinking in term of bidirectional fences, then acquire=#LoadLoad|#LoadStore, and release=#LoadStore|#StoreStore."
Ah, right, thanks... Just ghosts then. :-)

(Added) Well, except for the very real need to also process the referent before updating "out", if the referent's location can be reused.

Files cannot be download from http://groups.google.com/group/relacy/files right now.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today