Race-condition or 'false cache sharing'?

Race-condition or 'false cache sharing'?

Hi,

I have accidentally stumbled on this article

https://software.intel.com/en-us/articles/multithreaded-parallel-selecti...

which has better formatted copy here:

http://www.codeproject.com/Articles/1062000/Multithreaded-Parallel-Selec...

There are several claims in the article which are in direct opposite of anything I know about programming, so I would like to check here whether I was wrong all these years...

In section "Eliminating Threads Cache False Sharing issues" author claims that "false cache sharing issues" are somewhat causing his algorithm to become unstable.

All I see there is a trivial race condition:

#pragma omp parallel for default(none) private(nindex) shared(array,min_val)
// Iterating through the subset of the succeeding data items starting with
// the index+1's data item 
for (nindex = index + 1; nindex < N; nindex++)
     // Evaluating the condition statement by checking if the current data item's
     // value is less than the shared value of the smallest data item min_val
     if (array[nindex] < min_val)
     {
         // Entering the critical section
         omp_set_lock(&lock);
         // If the condition statement is evaluate to true, then the variable
         // min_val is assigned to the value of the smallest data item so far         
         min_val = array[nindex];
         // Leaving the critical section
         omp_unset_lock(&lock);
     }

Author locks critical section for writes, but not for reads. Thus two threads can perform 'if' at the same time, causing race condition - thread with lower array[nindex] locks later and writes incorrect value to min_val.

Author clearly encountered this bug, but explained it by 'false cacheline sharing'. I believe that 'false cacheline sharing' is a performance issue when two (or more) non-shared variables share the same cacheline and while it can significantly degrade performace, it cannot by itself cause incorrect behviour. I am wrong about this?

Thanks for explanation,

M. Fidler

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm sorry, you are asking this question in the wrong place. This forum is for the discussion of the implementation of the Open Source openMP runtime. Your question is about a general issue in parallel programming. 

(Which said, I agree with you. False-sharing is about performance, not correctness, and that code has a bug!)

James, I've got a couple of questions for you:

1.) why since this code has a bug it anyway exactly performs the correct sorting of an array of N data items regardless the bug encountered, which means that it functions correctly ?

2.) Why the other authors like Tim Mattson and other experts from Intel, in their OpenMP reference manuals explain that the false-sharing is not only the performance issue and also leads to code is not correctly performs its tasks. I basically refer to http://openmp.org/mp-documents/Intro_To_OpenMP_Mattson.pdf && https://docs.oracle.com/cd/E19205-01/819-5270/aewcy/index.html. And I can't understand why M. Fidler does not agree with the fact that 'false-sharing' actually causes the multiple cores to load the most recently used copy of a cache-line from memory and obtain the outdated previous values of a variable, since it has been modified by a thread and immediately marked as invalid and offloaded back into memory according to MESI protocol. When the entire cache-line is marked as invalid, the other threads are not able to access variables sitting on that cache-line. Therefore, when a variable sitting on the invalidated cache-line is accessed by the other threads for reading, each core loads the most recently used (MRU) copy of the cache-line which contains the outdated value of that variable. This typically prevents each thread in a team to obtain the correct values of variable and thus leads to the incorrect data processing. I'm sorry, but it's evident and listed in the most resources documentation by Intel and Oracle. Obviously that, in the most cases we have to implement padded arrays to store the values of variables obtained at the end of each thread execution.

3.) M. Fidler also refers to the bug with using omp_set_lock when it's called only when the value of min_val variable is modified. If you analyze the following code with Intel Inspector XE you'll reveal that in this case you've got an option  to either correct an issue or simply ommit it assuming that it's actually not an issue. As the matter of fact, if you correct this issue by invoking omp_set_lock(...) function when accessing min_val variable for reading in the condition statement being evaluated, you'll experience the sufficient performance degrade. Whereas, not implementing another lock will *NOT* cause the code to perform the incorrect sorting. That's actually why I ommited this issue when it was detected by Intel Inspector XE.  

Thanks for reading and understanding. Waiting for your reply post soon. I'd like to know your opinion about all those facts provided above.

Cheers, Arthur V. Ratz.

Attachments: 

1) Why the other authors like Tim Mattson and other experts from Intel, in their OpenMP reference manuals explain that the false-sharing is not only the performance issue and also leads to code is not correctly performs its tasks.

I have looked at both your references. Neither says that false sharing causes correctness problems. (Which is good, because it doesn't). Your code is not suffering from false sharing, but true sharing. All of your threads are trying to update the same location (min_val) at the same time, so you have a genuine race condition. (The Wikipedia article at https://en.wikipedia.org/wiki/False_sharing describes false sharing reasonably). 

The best way to handle your case would be to use an OpenMP user-defined reduction, since what your inner loop is doing is computing a min reduction while also remembering the index of (one of the) minimum value(s). If you don't want to do that, Tim's Pi example shows you alternative ways to roll your own reductions (and discusses the false sharing issues they may run into); code like that would work here too.

Your use of locks does not resolve any race issues (since it does not guard the read and associated write), so is futile.

Your description of the MESI protocol is also incorrect. The whole point of the protocol is to ensure that no thread ever sees an old value, whereas you assert that it causes that to happen.

Sorry, but this article is full of misinformation.

 

At the risk of prolonging a discussion on a forum where it's not topical, it seems possible to combine reduction(min: ) (now that it's been part of the OpenMP standard for over 2 years), with firstprivate and lastprivate clauses to take care of indices.  You could continue a discussion on a programming language forum for your chosen OpenMP compiler.

As it's important to use both simd parallelism and threaded parallelism, it's probably worth while to follow examples which use simd reduction in an inner loop and parallel reduction in an outer loop.  Cilk(tm) Plus includes (slow) minimum index simd reducers which you could use as a starting point (get your example working correctly before launching off into speculation about how a compiler should guess what you mean to do).

James, I've read your post. Thanks a lot for your reply, but I've got another two questions for you since all this seems much confusing to me.

Your use of locks does not resolve any race issues (since it does not guard the read and associated write), so is futile.

1. If I would remove the omp_set_lock(...) && omp_unset_lock(...) calls from the code, the code will *never* sort an array of data items properly. I've tried to do it multiple times and I'm convinced by the fact that it's really needed to invoke those functions at the point they're now. I recall that it doen't work correctly without it and I can't figure out why.

The best way to handle your case would be to use an OpenMP user-defined reduction, since what your inner loop is doing is computing a min reduction while also remembering the index of (one of the) minimum value(s). If you don't want to do that, Tim's Pi example shows you alternative ways to roll your own reductions (and discusses the false sharing issues they may run into); code like that would work here too.

2. When a faced the following problem with user-defined reduction at the time my code didn't work correctly, I've tried to survive this issue with multiple techniques. I've read lots of documentation about OpenMP reduction and tried to implement it and I had not worked for me. Then I've explored the Tim Mattson's Pi example that demonstrates the method of using padded arrays. I've tried to implement it and it worked in this case.

Your description of the MESI protocol is also incorrect. The whole point of the protocol is to ensure that no thread ever sees an old value, whereas you assert that it causes that to happen.

3. Exploring the MESI protocol I refered basically to:

"If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss." - https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

"However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited." - https://docs.oracle.com/cd/E19205-01/819-5270/aewcy/index.html.

Really, I still don't understand why my article contains a lot of misinformation?!?!

I'm very sorry. Waiting for your reply.

Thanks, Arthur V. Ratz.

 

 

Today, I've made changes to my article and I'd like to know if it would be published or not and when will it be published ?

As this already turned into vastely misplaced topic... My first reaction was to use double-checked pattern:

#pragma omp parallel for default(none) private(nindex) shared(array,min_val)
for (nindex = index + 1; nindex < N; nindex++)
     if (array[nindex] < min_val) // chance check
     {
         omp_set_lock(&lock);
         if (array[nindex] < min_val) // confirmation check
            min_val = array[nindex];
         omp_unset_lock(&lock);
     }

I know that double-checking has problems of its own, so I wonder if this is safe in context of OpenMP?

(In fact, I believe this places this discussion sort of back on topic... :)

After you must incorporate

   #pragma omp flush(min_val)

#pragma omp parallel for default(none) private(nindex) shared(array,min_val)
for (nindex = index + 1; nindex < N; nindex++)
{
     #pragma omp flush(min_val)
     if (array[nindex] < min_val) // chance check
     {
         omp_set_lock(&lock);
         #pragma omp flush(min_val)
         if (array[nindex] < min_val) // confirmation check
         {
            min_val = array[nindex];
            #pragma omp flush(min_val)
         }
         omp_unset_lock(&lock);
     }
}

As Tim mentioned it would be better to use the OpenMP reduction(min:min_val) clause

The equivalent to the reduction(min:...) clause is:

min_val = 0x7FFFFFFFFF; // your max value for type of min_val
#pragma omp parallel default(none) private(nindex) shared(array,min_val)
{
    int local_min_val = 0x7FFFFFFFFF; // your max value for type of min_val
    #pragma omp for default(none)
    for (nindex = index + 1; nindex < N; nindex++)
    {
        if (array[nindex] < local_min_val) // chance check
            local_min_val = array[nindex];
    }
    for(;;)
    {
        #pragma omp flush(min_val)
        int temp_min_val = min_val;
        if(temp_min_val <= local_min_val)
            break;
        if(__sync_bool_compare_and_swap(&min_val, temp_min_val, local_min_val))
            break;
    }
}

Jim Dempsey

Hi, Jim. I've tried to correct my code the way you've recommended in your post by either implementing a confirmation check or using reduction(min:min_val). Practically, these both modifications normally cause *serious* performance degrades compared to sequential single threaded sorting algorithm execution. I don't know the reason why, but it is a fact. Also I can't figure out why all those people for so many times were claiming that I should change my parallel code by following their recommendations. while using of all those things they're trying to recommend practially lead to the significant performance degrades. If you don't believe that it's true, you can check it by yourself. I've attached the two executables to this post: one of them is the first unchanged version of the parallel code compiled into binary I've enclosed to my article published, and another one is the parallel code compilation after it has been changed using the recommendations of all those people who discussed this subject.

 

Attachments: 

AttachmentSize
Downloadapplication/zip SortMP_v0.zip8.72 KB
Downloadapplication/zip SortMP_v1.zip8.94 KB

Quote:

jimdempseyatthecove wrote:

After you must incorporate

   #pragma omp flush(min_val)

#pragma omp parallel for default(none) private(nindex) shared(array,min_val)
for (nindex = index + 1; nindex < N; nindex++)
{
     #pragma omp flush(min_val)
     if (array[nindex] < min_val) // chance check
     {
         omp_set_lock(&lock);
         #pragma omp flush(min_val)
         if (array[nindex] < min_val) // confirmation check
         {
            min_val = array[nindex];
            #pragma omp flush(min_val)
         }
         omp_unset_lock(&lock);
     }
}

That is a lot of 'flush'es :) Thanks.

Just to be sure I understand things well: flush is supposed to make current min_val value visible across all threads. So the implementation will

(a) make sure that if min_val is per-thread local copy (e.g. register), it will be written to memory

(b) prevent reordering code by compiler

(c) insert memory barrier

Is that correct?

M. Fidler

Folks, this discussion remains completely off topic for this group, since absolutely none of it is about the implementation of the OpenMP runtime library.

I am going to work offline with Arthur to produce a coherent and correct version of the article. In the meantime I'm going to ask that the current version be removed.

p.s. @Jim message 9. No flushes are required, since they are implicit at omp_set_lock and omp_unset_lock, (but that's a detail compared with the rest of this).

At the risk of prolonging a discussion on a forum where it's not topical, it seems possible to combine reduction(min: ) (now that it's been part of the OpenMP standard for over 2 years), with firstprivate and lastprivate clauses to take care of indices.  You could continue a discussion on a programming language forum for your chosen OpenMP compiler.

As it's important to use both simd parallelism and threaded parallelism, it's probably worth while to follow examples which use simd reduction in an inner loop and parallel reduction in an outer loop.  Cilk(tm) Plus includes (slow) minimum index simd reducers which you could use as a starting point (get your example working correctly before launching off into speculation about how a compiler should guess what you mean to do).

Leave a Comment

Please sign in to add a comment. Not a member? Join today