Traversing concurrent_hash_map concurrently

People keep asking how to traverse tbb::concurrent_hash_map concurrently with other operations. But it wouldn’t be worth the blog if there was not any problem. The Reference clearly states that:

Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash.

So, officially, you cannot traverse the hash map concurrently with deletion, insertion, and even to lookup operations in some cases. Or in other words, we don’t guarantee that it will work. But what will actually happen if you try? Basically, an iterator doesn’t synchronize its operations with any concurrent operation. It keeps moving from pointer to pointer returning unprotected addresses of items. So, the bad news is that a concurrent erase operation can deallocate the item you are traversing through and you’ll get garbage or an access violation exception. However, the good news is that beside erase() no other concurrent operation can deallocate already published address in TBB >=2.2. It hints that concurrent traversal is nevertheless possible if synchronized with concurrent erasure. Let’s see what can happen while traversing while insert and lookup operations are also occurring.

Current concurrent_hash_map implementation uses classic “separate chaining” data structure for the hash table. It means that each bucket has a pointer to a linked list of stored items. And concurrent resizing can happen if the bucket count is less or close to the number of inserted items. But it works without reallocation of the table as in serial algorithms and leads to specific concurrent rehashing which moves some items from an old bucket with lower index into a new bucket with higher index. The following pseudo code shows the sequence of operations used to move an item and can be executed by any concurrent operation:

*prev_pointer = node->next; // exclude item’s node from old bucket
node->next = new_bucket->node_list;
new_bucket->node_list = node; // add item’s node to new bucket

The whole movement operation is executed under locks of corresponding buckets. However, as noted before, iterators don’t respect the locks and they are not visible for iterators except perhaps by some memory ordering. Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size. It means that a thread doing concurrent traversal will not read garbage but on the other hand can read an old, not-yet-updated value. So despite the fact that the whole data-structure will be in a valid state at any given time, an iterator can miss some actually stored items or meet the same item twice. For example, please consider an iterator pointing to an item moved to a new bucket, it has pointers to a bucket and a node of an item:

Traversal threadConcurrent rehashing

my_bucket = old_bucket;
my_node = my_bucket->node_list;

my_node = my_node->next;
if( !my_node ) my_node = (++my_bucket)->node_list;

..
..
my_node->next = new_bucket->node_list;
..
..


Here, other items stored in old_bucket are skipped and items stored in new_bucket can come up twice. It significantly limits traversal guarantees we can give for current concurrent_hash_map implementation and it is why we don’t support concurrent traversal officially. However, you might still want to traverse the map if you are fine with a chance of not visiting all your items in one pass or you can apply serial rehash() operation before traversal and so reserve enough buckets to prevent concurrent rehashing. For example, a use-case described here needs just to erase outdated items from the table. So, it is not sensitive to duplications and if in one pass it doesn’t meet some items, it will see them next time.

The code for such a traversal for removing outdated items may be like the following pseudo-code:

for(iterator = table.begin(); iterator != table.end(); iterator++ ) {
    // To prevent local reordering as discussed in the comments to the blog
    _ReadWriteBarrier(); // or acquire fence for weak-ordering processors
    // safe w/o concurrent erase but different words may be read inconsistently
    // only shallow copy permitted, don't follow pointers
    if( is_old( iterator->second.timestamp ) ) { // not protected from concurrent updates
        accessor acc; // however key cannot be changed
        table.find( acc, iterator->first ); // so get the lock
        if( is_old( acc->second.timestamp ) ) // and double-check
            table.erase( acc ); // erase by accessor
    }
}

Please note that the code takes snapshot without the lock or accessor and so may read some intermediate data but if timestamp is tbb::atomic<> of the word size, it will work fine anyway. The big advantage of this “lockless” approach is that it doesn’t block concurrent operations. If you want to test how does it work in your real hardware, please find the test example attached:

// Example that checks and demonstrates parallel traversal over tbb::concurrent_hash_map.
// Code is by Anton.Malakhov@intel.com
#include <utility>
#include <stdlib.h>
#include <assert.h>
#include "tbb/concurrent_hash_map.h"
#include "tbb/compat/thread"

typedef tbb::concurrent_hash_map<int,int> table_t;
table_t table;
const int max_keys = 1000000;

void collector() {
    while(!table.empty()) for(table_t::iterator it = table.begin(); it != table.end();) {
        // safe w/o concurrent erase BUT
        // note, it is not protected from concurrent modification by \"consumer\"
        std::pair<int,int> snapshot = *it++;
        if( snapshot.second < 0 ) { // visited
            table.erase( snapshot.first );
            assert(snapshot.first == -snapshot.second);
            #if VERBOSE
            printf(\"%d\r\", snapshot.first );
            #endif
        } else assert(snapshot.first == snapshot.second);
    }
}

void consumer() {
    while(!table.empty()) for(int i = 1; i <= max_keys; i++) {
        #if VERBOSE
        if(!(i%10000)) printf(\"\non %d\n\", i );
        #endif
        table_t::accessor acc;
        if( table.find( acc, i ) ) {
            if( acc->second < 0 )
                printf(\"Warning: recursion on <%d %d>: bad concurrency?\n\", i, acc->second);
            else
                assert( i == acc->second );
            acc->second = -acc->second;
        }
    }
}

int main() {
    table.insert( std::make_pair(max_keys, max_keys) );
    std::thread consumer_thread( consumer );
    std::thread collector_thread( collector );
    for(size_t c = 0; c < 10000000UL; c++) {
        int key = 1 + int(rand()) % max_keys;
        table.insert( std::make_pair(key, key) );
    }
    puts(\"awaiting cleanup\");
    consumer_thread.join();
    collector_thread.join();
    puts(\"done\");
    return 0;
}

Bottom-line

Although it is not documented, you can traverse tbb::concurrent_hash_map concurrently if you are fine with the following restrictions and possible workarounds for them:

  • You cannot traverse concurrently with an erase() operation. Use a mutex around erase and traversal code if you still need both running in different threads.
  • Some items can be missed due to concurrent rehashing. Try to reserve enough buckets before traversal using rehash() to prevent this or repeat the pass.
  • Some items can be visited twice or more times. Try to reserve enough buckets before traversal using rehash() to prevent this or put everything to a local serial map first to remove duplicates.
  • Access through iterators is not protected. Know what are you doing, e.g. use additional synchronization primitives like tbb::atomic to protect one member or acquire accessor to protect the rest.
For more complete information about compiler optimizations, see our Optimization Notice.

32 comments

Top
Dmitry Vyukov's picture

> Did you mean x64? I didn't manage to get such an output with VC9 /O2. Anyway, it is not a problem, just make your compiler do what you actually mean.

I meant 32-bit. Whatever, no problem on my side. Really.

Dmitry Vyukov's picture

> Sure. However, it is virtually not so essential to have it inside iterator implementation, you can put it in the iteration loop right after "iterator++".

Perhaps. But (1) one must educate end-user in this, (2) that slightly contradicts to optimal code generation on some platforms (in particular, I want ld.acq on IA-64), (3) anyway, not the way you initially suggest - tbb::atomic<> for timestamp.

Dmitry Vyukov's picture

>> Oops, load of foo_t::count hoist above load of the next pointer. That's how you can get partially constructed object even on x86/SPARC TSO.
> To "mimic" concurrent_hash_map, you forgot to put compiler fence before "bar(&f2);" at least.

That's a model of *iterating* thread. There is no fences on *iteration* side in TBB.

Anton Malakhov (Intel)'s picture

Dmitry, I see you still misunderstand the intention of this blog. Sorry, if it breaks your belief. :) But first, let me answer your comments:

>Oops, load of foo_t::count hoist above load of the next pointer. That's how you can get partially constructed object even on x86/SPARC TSO.
To "mimic" concurrent_hash_map, you forgot to put compiler fence before "bar(&f2);" at least.

>And, yes, I have evidence, just read the previous comment and investigate the assembly.
Evidence of what? It looks like general discussion of compiler reordering, off-topic.

>>TBB_release_consistency_helper which is also compiler fence on most platforms except PPC
>... except PPC, Alpha, IA-64, SPARC RMO. Whatever, one platform is already enough.
Now you are mixing up compiler and hardware fences yourself. Basically, what I'm saying is that any operation with tbb::atomic has full compiler fence preventing any reordering across that point, even on PPC and IA64 (but at the same time they have some hardware fence, load_with_acquire on x86 is just a plain load with compiler fence). And no compiler switch can change it, otherwise the whole TBB is in danger.

>On the iteration side, but in the wrong place. You need an acquire fence when you load 'next'.
Sure. However, it is virtually not so essential to have it inside iterator implementation, you can put it in the iteration loop right after "iterator++".

>So the program can just silently hangup provided concurrent mutations of the timestamp. Your argument that 64-bit aligned loads are atomic on x86 makes no sense. It's not x86, it's C++.
Did you mean x64? I didn't manage to get such an output with VC9 /O2. Anyway, it is not a problem, just make your compiler do what you actually mean.

>Well, go and prove the correctness for every combination of compiler/hardware out there ;)
And it is exactly why it is not officially supported. Again, the intention of the blog is to show how hard and limited the traversal is in current concurrent_hash_map, but still give a chance to those who really need it right now - for particular combination of platform, compiler, and usage model. Still, not guaranteed to work, since it is not official, it is user's responsibility to put all the necessary synchronization and prove the correctness. And it is definitely not a long-term solution, nor a general one. Sorry, if it is not what you need or believe it should be.
Thanks to Robert for understanding it right.

>Fence embed into atomic increment in insert_new_node() does not help according to C++0x memory ordering rules (do not know what TBB memory ordering rules), however will most likely do in practice.
Please, take me right, it is not to say I disagree with your arguments about general correctness and application. Having it working for limited number of cases for some of currently supported by TBB platforms/compilers is perfectly enough for me to justify the blog. So, the only solid guarantee I'd like to have is no partially constructed objects observed by traversal for TSO processors (x86, x64). And you seem to agree it will work in practice.

Anyway, thank you very much for a lot of comments, it is the first time my blog has so much ;-) If you think some wording or pseudo-code needs more details, please suggest. But please no more discussions here about correctness issues in general. :) Let's see what we can do for supported compilers and processors.

anonymous's picture

A while back I had an application iterating on a hash, just like the example, but as Dmitriy mentioned I was using shared pointers, and once day I got a bad access. Should of known better though :)

Anyways now instead of erasing I have the eraser threads turn on the dead-flag of the objects and then collect them in the iteration loop. This works fine. It's definitely a useful, and dangerous technique, but pays off knowing it.

Dmitry Vyukov's picture

Btw, here is another interesting example on the "atomicity of timestamps":

[code=cpp]
void baz(time_t* t)
{
time_t tt = *t;
if (tt == 0)
return;
time_t now = time(0);
for (time_t i = tt; i != now; i += 1)
printf(".");
}
[/code]

The loop is intended to make only a handful of iterations. However on MSVC2008, x86 I get the following machine code:

[code]
baz:
01021030 sub esp,8
01021033 push ebx
01021034 mov ebx,dword ptr [eax+4]
01021037 push ebp
01021038 push esi
01021039 mov esi,dword ptr [eax]
0102103B mov eax,esi
0102103D or eax,ebx
0102103F push edi
01021040 je baz+4Bh (102107Bh)
01021042 push 0
01021044 call dword ptr [__imp___time64 (102209Ch)]
0102104A mov edi,eax
0102104C add esp,4
0102104F mov dword ptr [esp+14h],edx
01021053 cmp esi,edi
01021055 jne baz+2Bh (102105Bh)
01021057 cmp ebx,edx
01021059 je baz+4Bh (102107Bh)
0102105B mov ebp,dword ptr [__imp__printf (10220A0h)]
01021061 push offset string "." (1022108h)
01021066 call ebp
01021068 add esp,4
0102106B add esi,1
0102106E adc ebx,0
01021071 cmp esi,edi
01021073 jne baz+31h (1021061h)
01021075 cmp ebx,dword ptr [esp+14h]
01021079 jne baz+31h (1021061h)
0102107B pop edi
0102107C pop esi
0102107D pop ebp
0102107E pop ebx
0102107F add esp,8
01021082 ret
[/code]

As you may see the load of the timestamp is not atomic:
01021034 mov ebx,dword ptr [eax+4]
01021039 mov esi,dword ptr [eax]

So the program can just silently hangup provided concurrent mutations of the timestamp.
Your argument that 64-bit aligned loads are atomic on x86 makes no sense. It's not x86, it's C++.

Dmitry Vyukov's picture

> Anyway, using tbb::atomic<> for timestamp will add necessary fences also on the iteration side.

On the iteration side, but in the wrong place. You need an acquire fence when you load 'next'.

Dmitry Vyukov's picture

> Who cares about too abstract C++0x ordering? Only real hardware and implementation matters.

Well, go and prove the correctness for every combination of compiler/hardware out there ;)
Abstract model allows you bring the complexity of proving from O(N*M) down to O(N + M). In particular, when atomics and fences are implemented by someone else, and you need to prove correctness of a single algorithms the complexity becomes O(1), which is nice to say the least.

Dmitry Vyukov's picture

> And how compiler could reorder the loads to get something being constructed in a different thread?

Anton, read the previous comment with assembly output.
Another example is compiler speculates on contents of the object and uses non-failing loads to load some data, then verifies the speculation.
Anyway, it's just a guesswork. C/C++ guarantees for racy programs are pretty simple - no guarantees at all. Anything can happen. Period.
Anything can happen because of the reasons we have no idea as of now. And when we will have an idea... well, it will be already too late.

Dmitry Vyukov's picture

> TBB_load_with_acquire uses basically the same fence as store with release, i.e. either compiler fence for ia32 or TBB_release_consistency_helper which is also compiler fence on most platforms except PPC

... except PPC, Alpha, IA-64, SPARC RMO. Whatever, one platform is already enough.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.