concurrent_vector parallel push_back operations

concurrent_vector parallel push_back operations

Hi all,

I am seeking some clarification on the documentation for concurrent_vector. I would like to be able to call push_back in parallel, and I know that this is supported already by concurrent_vector. However the documentation says this happens atomically, and I'd like to ask for clarification on what is meant by that. I'm familiar with the TBB atomic operations, but I just want to ensure I understand what it means in this context.

Basically I have a large set of data that I want added to concurrent_vector, well, concurrently. I don't care about the order, just about jamming it all in there as quickly as possible. If push_back operations have any sort of mutex, my code will choke.

Another slightly related question, if I wanted to append one concurrent vector to another, it could harness parallelism by calling push_back() in parallel on everything in the vector to be appended. I see no such operation in concurrent_vector, and I will likely write my own external function to help. However, it might be a good addition.

Thanks,

AJ

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Answering directly, push_back() uses fetch_and_increment() atomic operation to getindex of the new item and increment logical size of the container. However, if physical size is not enough and new segment must be allocated, the first thread which reached this boundaryallocates memory and any other thread which is asking for following items at the same timegoes in spin loop until allocation is really done. It actually happens rarely and can be prevented by callingreserve() method (look for "Fragmentation" in the documentation).

The real problem of intensive pushing data from different threads could be false-sharing. The good idea is to cache new items in a thread and to push them later.

You could find useful our Convex Hull example (examplesparallel_reduceconvex_hullconvex_hull_bench.cpp)to see how it could be done and to try different modes controlled by preprocessor. For instance, here is append function you asked for:

void appendVector(const point_t* src, size_t srcSize, pointVec_t& dest) {

std::copy(src, src + srcSize, dest.begin() + dest.grow_by(srcSize));

}

Thanks, that does make sense... I didn't think of the fact that the index could be atomically incremented...

You've mentioned that false-sharing could be an issue with pushing data from different threads. I'm not sure what the best strategy for building the containers is then.

I'm in a situation where I don't need to reference the data again in a particular parallel execution... I just need to collect it, and later parallel iterations will use it.

So I have been using a parallel_reduce operation where I merge vector elements together piecemeal, and then merge the whole thing into a global vector. Alternatively with how you've described concurrent_vector to me, I could just pop the data into the global concurrent_vector rather than doing this merge operation.

If you are facing a false-sharing issue, themost obvious way is to reduce the frequency of contention. That can happen by one simple expedient: ensure theblocks being appended tothe vector are bigger than a cache line (64 bytes on current processors or 128 bytes on some older architectures with sectored L2 cache and adjacent-sector-prefetch turned on). The false sharing that Anton mentionscould occur whenever the current end of the concurrentvector falls short of the next cache line boundaryand another thread appends more data in the latter half of that cache line. At worst itwould just be a flip: cache line pulled to one PE, then flushed and pulled to a second PE to do the second append. As you say, you won't touch it again for a while, so the cost of any false sharingis not likely to be large.

If you are dealing with a lot of small(with respect to cache line size) vector components, one method to ensure larger blocks on the ultimate concatentation is to have a penultimate concatenation: collect the accumulating data in a thread private data structure (each thread has its own) so there's no contention involved for the small block accumulation. Then periodically and/or at the end, each of the collection threads would do one append to the concurrent vector, virtually eliminating the possibility of false sharing and also reducing contention on atomic access of the vector management pointers.

But with TBB and private data structures, therein lies the rub. TBB does not currently support Thread Local Storage. Arch has written a blog talking about the issues associated with thread local storage and his article might provide you with some ideas to deal with this, if you actually have a problem. But as I've suggested above, your code may or may not have a problem, and if it does it may be easy to sidestep just by changing the minimum size of the blocks to be appended.

The documentation may be misleading: push_back() does not seem to be
atomical, only the element allocation part of it. But why does size()
not say that it "may include elements that are under construction by
concurrent calls to the [push_back or] grow_by or grow_to_at_least
method" (my [addition])?

Thank you for the catch!
I'll fix this part of the reference.

I've posted the blog which sheds the light on memory organization of concurrent_vector. It relates to fragmentation and push_back() topics.

Leave a Comment

Please sign in to add a comment. Not a member? Join today