Traversing concurrent_hash_map concurrently

People keep asking how to traverse tbb::concurrent_hash_map concurrently with other operations. But it wouldn’t be worth the blog if there was not any problem. The Reference clearly states that:

Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash.

So, officially, you cannot traverse the hash map concurrently with deletion, insertion, and even to lookup operations in some cases. Or in other words, we don’t guarantee that it will work. But what will actually happen if you try? Basically, an iterator doesn’t synchronize its operations with any concurrent operation. It keeps moving from pointer to pointer returning unprotected addresses of items. So, the bad news is that a concurrent erase operation can deallocate the item you are traversing through and you’ll get garbage or an access violation exception. However, the good news is that beside erase() no other concurrent operation can deallocate already published address in TBB >=2.2. It hints that concurrent traversal is nevertheless possible if synchronized with concurrent erasure. Let’s see what can happen while traversing while insert and lookup operations are also occurring.

Current concurrent_hash_map implementation uses classic “separate chaining” data structure for the hash table. It means that each bucket has a pointer to a linked list of stored items. And concurrent resizing can happen if the bucket count is less or close to the number of inserted items. But it works without reallocation of the table as in serial algorithms and leads to specific concurrent rehashing which moves some items from an old bucket with lower index into a new bucket with higher index. The following pseudo code shows the sequence of operations used to move an item and can be executed by any concurrent operation:

*prev_pointer = node->next; // exclude item’s node from old bucket
node->next = new_bucket->node_list;
new_bucket->node_list = node; // add item’s node to new bucket

The whole movement operation is executed under locks of corresponding buckets. However, as noted before, iterators don’t respect the locks and they are not visible for iterators except perhaps by some memory ordering. Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size. It means that a thread doing concurrent traversal will not read garbage but on the other hand can read an old, not-yet-updated value. So despite the fact that the whole data-structure will be in a valid state at any given time, an iterator can miss some actually stored items or meet the same item twice. For example, please consider an iterator pointing to an item moved to a new bucket, it has pointers to a bucket and a node of an item:

Traversal threadConcurrent rehashing

my_bucket = old_bucket;
my_node = my_bucket->node_list;

my_node = my_node->next;
if( !my_node ) my_node = (++my_bucket)->node_list;

..
..
my_node->next = new_bucket->node_list;
..
..


Here, other items stored in old_bucket are skipped and items stored in new_bucket can come up twice. It significantly limits traversal guarantees we can give for current concurrent_hash_map implementation and it is why we don’t support concurrent traversal officially. However, you might still want to traverse the map if you are fine with a chance of not visiting all your items in one pass or you can apply serial rehash() operation before traversal and so reserve enough buckets to prevent concurrent rehashing. For example, a use-case described here needs just to erase outdated items from the table. So, it is not sensitive to duplications and if in one pass it doesn’t meet some items, it will see them next time.

The code for such a traversal for removing outdated items may be like the following pseudo-code:

for(iterator = table.begin(); iterator != table.end(); iterator++ ) {
    // To prevent local reordering as discussed in the comments to the blog
    _ReadWriteBarrier(); // or acquire fence for weak-ordering processors
    // safe w/o concurrent erase but different words may be read inconsistently
    // only shallow copy permitted, don't follow pointers
    if( is_old( iterator->second.timestamp ) ) { // not protected from concurrent updates
        accessor acc; // however key cannot be changed
        table.find( acc, iterator->first ); // so get the lock
        if( is_old( acc->second.timestamp ) ) // and double-check
            table.erase( acc ); // erase by accessor
    }
}

Please note that the code takes snapshot without the lock or accessor and so may read some intermediate data but if timestamp is tbb::atomic<> of the word size, it will work fine anyway. The big advantage of this “lockless” approach is that it doesn’t block concurrent operations. If you want to test how does it work in your real hardware, please find the test example attached:

// Example that checks and demonstrates parallel traversal over tbb::concurrent_hash_map.
// Code is by Anton.Malakhov@intel.com
#include <utility>
#include <stdlib.h>
#include <assert.h>
#include "tbb/concurrent_hash_map.h"
#include "tbb/compat/thread"

typedef tbb::concurrent_hash_map<int,int> table_t;
table_t table;
const int max_keys = 1000000;

void collector() {
    while(!table.empty()) for(table_t::iterator it = table.begin(); it != table.end();) {
        // safe w/o concurrent erase BUT
        // note, it is not protected from concurrent modification by \"consumer\"
        std::pair<int,int> snapshot = *it++;
        if( snapshot.second < 0 ) { // visited
            table.erase( snapshot.first );
            assert(snapshot.first == -snapshot.second);
            #if VERBOSE
            printf(\"%d\r\", snapshot.first );
            #endif
        } else assert(snapshot.first == snapshot.second);
    }
}

void consumer() {
    while(!table.empty()) for(int i = 1; i <= max_keys; i++) {
        #if VERBOSE
        if(!(i%10000)) printf(\"\non %d\n\", i );
        #endif
        table_t::accessor acc;
        if( table.find( acc, i ) ) {
            if( acc->second < 0 )
                printf(\"Warning: recursion on <%d %d>: bad concurrency?\n\", i, acc->second);
            else
                assert( i == acc->second );
            acc->second = -acc->second;
        }
    }
}

int main() {
    table.insert( std::make_pair(max_keys, max_keys) );
    std::thread consumer_thread( consumer );
    std::thread collector_thread( collector );
    for(size_t c = 0; c < 10000000UL; c++) {
        int key = 1 + int(rand()) % max_keys;
        table.insert( std::make_pair(key, key) );
    }
    puts(\"awaiting cleanup\");
    consumer_thread.join();
    collector_thread.join();
    puts(\"done\");
    return 0;
}

Bottom-line

Although it is not documented, you can traverse tbb::concurrent_hash_map concurrently if you are fine with the following restrictions and possible workarounds for them:

  • You cannot traverse concurrently with an erase() operation. Use a mutex around erase and traversal code if you still need both running in different threads.
  • Some items can be missed due to concurrent rehashing. Try to reserve enough buckets before traversal using rehash() to prevent this or repeat the pass.
  • Some items can be visited twice or more times. Try to reserve enough buckets before traversal using rehash() to prevent this or put everything to a local serial map first to remove duplicates.
  • Access through iterators is not protected. Know what are you doing, e.g. use additional synchronization primitives like tbb::atomic to protect one member or acquire accessor to protect the rest.
For more complete information about compiler optimizations, see our Optimization Notice.

32 comments

Top
Anton Malakhov (Intel)'s picture

No, especially if you call rehash() in between.
Another hint - for serial lookup, please use equal_range(). It is not thread-safe method and thus takes less synchronization burden.

rhl_'s picture

Hi, if your use of the object is as follows:

lots of concurrent insertions into the hash map.

then
lots of concurrent lookup operations (such as find()), but absolutely no, inserts or deletes ever..

will you run into any trouble with concurrent traversal?

Anton Malakhov (Intel)'s picture

I've updated the blog with compiler barrier and comment about acquire fence.

Anton Malakhov (Intel)'s picture

>*next* pointers must be stored with store-release; not store-release in the lock - iterating thread does not access the the lock, so it does not matter what happens in the lock.
Sorry, I should say that atomic increment before the pointer publication serves as store-release fence, not the lock release after - but we discussed this before already.

>Then, I do not see any mention of user's responsibility to provide any synchronization on each iteration.
Any synchronization on user-data is enough to prevent compiler reordering between the loops. And I'm still not convinced that synthetic example you provided applies for traversal in practice because there is data-dependency that compiler should respect.

>Then, I do not see any mention that everything you said relates to x86 only.
Well, I'm basically targeting for x86, but my test works fine for IA64. So, I'm not sure that [all the] weak-ordering architectures will not work. But, my test is not deep enough. At least, I agree that there should be some acquire fence for iteration to prevent local reordering in the traverse loop.

And sure, I'd love to provide well-grounded and proven guarantees. But it actually means official support.

Dmitry Vyukov's picture

To make my point clear.
It's perfectly Ok to provide some limited specialized operations (which buy some performance) with me. It must not be done in "seems to work on my machine" way, though.
First, one needs to clearly define what is supposed to work and what is not. Then, provide some well-grounded foundations for the former to actually work. I.e. all required fences must be still in place, atomicity preserved, etc. Perhaps, one may also provide some helper primitives for a user - SeqLock, atomics with relaxed ordering (which are basically costless, but ensure correctness), etc.

Dmitry Vyukov's picture

>> During mutations store 'next' pointers with store-release.
> It is already done in concurrent_hash_map (store-release in the lock)

*next* pointers must be stored with store-release; not store-release in the lock - iterating thread does not access the the lock, so it does not matter what happens in the lock.

>>During iteration load 'next' pointers with load-acquire.
>And since on x86, load-acquire is just a plain load with (full) compiler fence, preventing compiler reordering right in the traversal loop is enough. Which aligns well with users responsibility for synchronization.

Yes, plain load is required, but load on machine level, not on C++ level. C++ level load can result in 0..inf machine level accesses.
Then, I do not see any mention of user's responsibility to provide any synchronization on each iteration.
Then, I do not see any mention that everything you said relates to x86 only.

>So, do you still disagree that it will work? ;-)

Definitely.

Anton Malakhov (Intel)'s picture

>I don't.
Don't you? (Looking at the following description)

> During mutations store 'next' pointers with store-release.
It is already done in concurrent_hash_map (store-release in the lock)

>During iteration load 'next' pointers with load-acquire.
And since on x86, load-acquire is just a plain load with (full) compiler fence, preventing compiler reordering right in the traversal loop is enough. Which aligns well with users responsibility for synchronization.
So, do you still disagree that it will work? ;-)

>Following your line: "Fortunately, modern processors read or write pointer locations...
Please pay attention to "pointer locations". I meant (actually it was reworded during the review) only machine words.

Dmitry Vyukov's picture

> 64-bit is not a word size on 32-bit

Following your line: "Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size."

x86-32 reads and writes 8-, 16-, 32-, 64-, 128-bit aligned locations atomically. But that's irrelevant if one doesn't write in assembly.

Dmitry Vyukov's picture

> And you seem to agree it will work in practice.

I don't.
IMHO what can do the trick of limited iteration is the following.
During mutations store 'next' pointers with store-release.
During iteration load 'next' pointers with load-acquire.
Protect read-side with SeqLock.

That will provide quite strong and well-grounded guarantees for limited iteration w/o incuring significant overheads. Still no shared_ptr's.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.