concurrent_vector remove / erase elements

concurrent_vector remove / erase elements

Hey,

I know from the documentation that I cannot really remove / erase elements from a concurrent_vector in a direct manner. This is rather prohibitive. I mean, I could create a new vector without that element... and swap... but that's going to be overkill.

Is there a trick that could be used, or ideas for supporting this in the future?

AJ

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You could use the same technique as std::vector::erase() uses - shifting items ina tail to newly free space. But it is not thread-safe in terms of parallel access to the same items. And it doesn't deallocate the memory.

In order to free a memory, you need copy anyway.

Couldn't the concurrent_vector do something like mark which indices are ready to be "removed" and just do the operation in an operation like compact() or something... instead maybe apply_erase().

The iterator could just "skip over" these items as if they are gone, and in reality a serial or mutex-protected region of code could call such an apply_erase() operation. Leave it to the user to force the actual erase to be applied, I'd be happy with that.

Sure, you could add lots of hooks to concurrent_vector, but those additions would probably make it a whole lot slower. Marking vector entries as invalid requires a flag per element. You could embed it with each entry (has implications for the user type used to define the vector element) or collect them all together somewhere else (forcing more cache lines to be available for simple array access and possibly introduce false sharing issues). Either way, the accessor methods would be complicated by extra checking to avoid entries that have been marked invalid. Currently the accessor does a simple two-level lookup to find the segment and the index within the segment:

template T& concurrent_vector::internal_subscript( size_type index ) const { __TBB_ASSERT( index(my_segment[k].array)[j]; }

Moreover, the thing that makes TBB concurrent vector philosophically different than its STL equivalent is that the allocations are never relocated as would have to be done withthe suggested compact() operation. It's more important to preserve vector locations and any cache residency that may be based on that memory layout than to provide operations that might disrupt the current cache context. Doing a compact to remove invalid elements is directly counter to that philosophy.

There is nothing to prevent you from defining a vector element type that contains a validity flag. You could easily perform some experiments on this to convince yourself of the extra overhead that would be required to add such features.

Okay, so what do I do to remove elements then ? As you say maintaining location for the cache is critical, and I agree with you. In a sense you would have to move "deleted" memory to the beginning or end, and then adjust appropriately. But how can I even do this adjustment? I have some ideas I'll play with tonight...

It's proving to be rather difficult to find a balance between the computer science theoretic stuff and what's going to happen with the machine...

Based on how this will be used a better solution can almost certainly
be found than basically trying to push a square peg into a round hole:
how many reads relative to mutations, how must concurrent mutations be
handled (both concurrent with an iterator's lifetime and specifically at an iterator's current position), how relevant is random access, what are lifetime and size
characteristics, how many deletions relative to additions and size,
will there be any other mutations than deletions, what are the
concurrency performance requirements, what is the ultimate purpose
(that priority queue mentioned elsewhere?), ...

I don't understand your metaphor that erasing elements is pushing a square peg into a round hole. Erasing elements from containers is a common requirement of algorithms... unless a merge operation is used which accumulates results, which I have done before... but I'm not sure this approach is any better than straight erase with a mutex.

Your post indicates that you have a good understanding on many issues with parallel data structures, please share with us some of the possible avenues that you are considering could be explored, rather than erasing elements. Perhaps there are common patterns to parallel data structures that you can share with us?

Ultimately, yes, a priority queue is something that I could use a concurrent_vector forever. However parallelizing a priority queue would be non-trivial. I have read a few papers on the subject, and have not yet found a satisfactory algorithm.

Currently I am using a modified version of concurrent_vector with pop_back() to support a concurrent stack implementation. The push and pop operations still require a mutex for real concurrency.

The square peg/round hole point is about using specifically tbb::concurrent_vector as a data structure when deletions may occur: even for serial code, std::list or even a std::set may be a better fit than std::vector if random access is not required (a priority queue organised as a "heap" would make use of random access, but also would not delete random elements from the std::vector), and concurrent access/cache behaviour make that probably even more true for tbb::concurrent_vector.

I'm far from being an expert on the matter, but in case this is about that priority queue, I replicate here my suggestion to have a tbb::concurrent_queue drain into a std::priority_queue (the encapsulating object would provide direct access to the tbb::concurrent_queue for pushing new values, and synchronised access for popping the top of the tbb::priority_queue after first emptying the tbb::concurrent_queue into the tbb::priority_queue). Do you have requirements that might make this an unsuitable solution? Even then tbb::concurrent_vector seems unlikely to be a suitable foundation.

Adrien, I suggested an approach that may solve your current problem. Since you have control over the type that defines the structure of the vector elements, can you not just add a validity flag? That way you don't have to remove elements, just invalidate them. If you're doing a lot of invalidation, it may prove useful at some point to lock the old vector and essentially do a garbage collection run by copying the still-valid elements to a new vector, then replace old with new and unlock. It will move all the vector elements and invalidate any caches that might save some memory access, but at the point you decide the vector needs to be compacted, all bets are off anyway.

First, I appreciate your patience with me as I'm still learning the internals of TBB, and about the low-level processor details... I appreciate the help I've been receiving on this forum.

I've decided to approach my problem in a different manner, so I've worked around not having erase for now. At this point, I hope that this thread might help others who come up against the same question. It is my hope that talking about the issue could help us come up with a way to provide an erase in an efficient manner. Perhaps come up with an idiom to be able to tell others 'when you need to erase from a concurrent_vector, here's what you do...'

I thought about your example of marking elements invalid, and I've also thought on how this can be done using bit masks on the vector internally so that the user wouldn't have to supply the flag themselves. The problem is, as I think you pointed out, moving elements around internally will hurt cache performance. Perhaps it's not possible to avoid this... after all we are moving around data that is contiguous.

I suspect (not sure) that it is possible to use template programming tricks to ensure that two concurrent_vectors are effectively generated, one that supports erase() with the overhead and one that generates the code we have now... the generation of the erase() supporting vector would be triggered by an actual call to erase() somewhere along the line.

The code supporting erase could have an iterator that skips over invalidated data, and a "compact_erased()" routine that will actually move things around and hence cause the performance hit. The documentation can warn the user about the performance hits involved with erase() and compact_erased().

What are your thoughts on this approach?

Perhaps Arch or Alexeycould come up with appropriate "template programming tricks" to trick out the expanded concurrent_vector you suggest, but I doubt it. Everything from constructors down to accessors would have to be duplicated in order to have versions that check for validity and take the performance hit. And for what? a version of concurrent_vector that doesn't scale but at least is type-safe? I think there are more important things for the development team to worry about, but that's just me.

While I'm here, let me make one little adjustment to your statementon the problem of carrying validity bits in a bit mask. It's not moving elements around internally that will hurt performance. It's the fact that the bit mask is probably going to be in cached memory, which means, if you're really talking bits, that each flag shares a cache line with up to 511 of its neighbors. Any thread writing a bit needs to pull that part of the bit mask into its local cache/store buffers, which means that threads could serialize on the validity test. At least if the validity bits are distributed with the elements, collisions are less likely.

I did not intend for Arch or Alexey to make the adjustments or perform the tricks. This is an open source project, and I would contribute these changes myself for review by the team, once appropriate methods were found to for efficient erasures.

I am not by training an engineer, I'm a computer science and mathematics major. The TBB team has a wealth of experience with processor and compiler issues. I don't have this background or training yet, but I am examining the TBB code on a regular basis, reading a lot of books, and attempting to contribute to the code base as much as I can with the skills I currently have.

Each time I do encounter something I don't know about, I buy a book and read it. Are there books I can read to help me understand the cache issues?

What do you think of these ones: http://shop.intel.com/shop/product.aspx?pid=sibk3603

Engineers? Where? I was a physics and math major when I was in school. And I love having a job that makes me deal with questions like those you raise, because it makes me strive to become more knowledgable, in order to help you (that's the collective "you").

Books from Intel Press? They're all great! Honestly, I must say that I don't own any of the books in the quartet referenced in your last post. I did do some review work on the first edition of Rich Gerber's book, and I looked over the table of contents of Shameem's book and found it to have reasonable coverage, but Ihaven't gotten a copy yet. Stewart Taylor's book is about Intel Integrated Performance Primitives and James Reinders' is about VTuneAnalyzer; neither is likely to explain much about multi-core programming issues asa primary topic, but either of the first two could be useful.

Leave a Comment

Please sign in to add a comment. Not a member? Join today