Thread-safety of writing a std::vector

Thread-safety of writing a std::vector

Is std::vector support concurrently read - write access for Cilk workers ? And which vector operator should I use ? Now  I'm using like " data [i] = x ;  " for assignment.

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If your code performs read/write operations within a parallel loop (e.g., cilk_for), it really depends on what are the subscripts (or indices) to access the vector elements. In the following example, every iteration accesses different elements of a[], but two consecutive iterations write and read the same element of b[] (loop-carried data dependence), which means two different threads may write/read the same memory location in an arbitrary order.

std::vector<int> a(N), b(N);
...
cilk_for (int i = 1; i < N; i++) {
    a[i] = 2*i;
    b[i] = b[i-1] + a[i];
}

So, in summary, you can use the vector operation in your example "data[i] = x" if "i" is the loop control variable, but you need to make sure there is no loop-carried data dependence (e.g., b[]) in your loop to parallelize it correctly.

Google is your friend. In a nutshell, STL does nothing to ensure thread safety. You have two options:

  1. Use a lock to arbitrate access to your STL classes if there's any chance of modifications in parallel with other writers or readers. This will fix the crashes you're undoubtedly seeing. However it won't do anything to enforce ordering so you'll probably get different results every run. Another problem is that locks can cause contention - you may lose all of your performance gain waiting for the lock. In really bad cases, performance can be worse than serial. Finally, if you use more than one lock (and sometimes calls you make to the OS or 3rd party libraries can use locks on your behalf) you risk deadlocks.
  2. Use a reducer_list or reducer_vector. Like all reducers, these have the advantage of maintaining serial semantics. This means that you'll end up with the results in the same order as if you ran the code serially. See the alphabet example in the Cilk Plus tutorial for an example. .

   - Barry

C++11 did go to some effort to specify the thread-safety (or lack thereof) for containers.  If you look at the C++11 standard (or a fairly close draft such as the later N3337, you'll find a section titled "Container data races" that describes which container operations are "const" (do only reads internally).  For example, operator[] is "const" for std::vector, but not for std::map.  Of course if you have workers reading/writing to the reference returned by vector<T>::operator[], the usual race rules apply.  But you don't have to worry bout std::vector introducing a race inside its internal mechanisms.

Here part of code ...

    vector<char> primary (14348907*15);
    vector<char> secondary (14348907);
   . . . . 

 cilk_for (int i= 0; i< 14348907; ++i)
    {
        int value= 0;

        for (int j= 14; j>= 0; j--)
        {
            value+= (primary[(14 - j) + (i* 15)] * (pow (3 , j)));
        }

        secondary [value] = 1;

    }

And test result....

Now I want to learn should I use lock otherwise should I use another container like reducer_list or reducer_vector. Which way is better for performance  ?

It is true that there are data races coming from the writes to "secondary", but you can ignore it in your code since the race does not affect the semantics of your code; the contents of "secondary" do not depend on the ordering of the write operations since they are loop-invariant (constant).

While Hansang is correct that you code will give the correct answer, you may still suffer performance problems due to cache effects.

In modern CPUs, memory is broken into "cache lines" of 64 bytes. When you attempt to access a memory location, the CPU reads the 64 bytes that contains that location into it's cache, modifies it, and (eventually) writes it back out to main memory. If you're accessing multiple locations in the cache line, or if you update the same location repeatedly, this is a major performance boost. There are various levels of cache, referred to as L1, L2, etc. L1 is closest to the CPU, and fastest for it to access.

The problem comes when another CPU wants to access a variable in that cache line. You're OK as long as they're all just reading the location. But if one CPU updates a variable in the cache line, then it invalidates the cache line in all other CPUs. The other CPUs must refetch the cache line before it's accessed again.

This can lead to the cache line "ping ponging" between CPUs, which will hurt performance. A particularly pernicious version of this is called "false sharing" two variables, each only modified by a single CPU, are in the same cache line. This is discussed in depth in the article "Avoiding and Identifying False Sharing Among Threads"

   - Barry

Login to leave a comment.