People keep asking how to traverse tbb::concurrent_hash_map concurrently with other operations. But it wouldn’t be worth the blog if there was not any problem. The Reference clearly states that:
So, officially, you cannot traverse the hash map concurrently with deletion, insertion, and even to lookup operations in some cases. Or in other words, we don’t guarantee that it will work. But what will actually happen if you try? Basically, an iterator doesn’t synchronize its operations with any concurrent operation. It keeps moving from pointer to pointer returning unprotected addresses of items. So, the bad news is that a concurrent erase operation can deallocate the item you are traversing through and you’ll get garbage or an access violation exception. However, the good news is that beside erase() no other concurrent operation can deallocate already published address in TBB >=2.2. It hints that concurrent traversal is nevertheless possible if synchronized with concurrent erasure. Let’s see what can happen while traversing while insert and lookup operations are also occurring.
Current concurrent_hash_map implementation uses classic “separate chaining” data structure for the hash table. It means that each bucket has a pointer to a linked list of stored items. And concurrent resizing can happen if the bucket count is less or close to the number of inserted items. But it works without reallocation of the table as in serial algorithms and leads to specific concurrent rehashing which moves some items from an old bucket with lower index into a new bucket with higher index. The following pseudo code shows the sequence of operations used to move an item and can be executed by any concurrent operation:
[cpp:nocontrols]*prev_pointer = node->next; // exclude item’s node from old bucket
node->next = new_bucket->node_list;
new_bucket->node_list = node; // add item’s node to new bucket[/cpp:nocontrols]
The whole movement operation is executed under locks of corresponding buckets. However, as noted before, iterators don’t respect the locks and they are not visible for iterators except perhaps by some memory ordering. Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size. It means that a thread doing concurrent traversal will not read garbage but on the other hand can read an old, not-yet-updated value. So despite the fact that the whole data-structure will be in a valid state at any given time, an iterator can miss some actually stored items or meet the same item twice. For example, please consider an iterator pointing to an item moved to a new bucket, it has pointers to a bucket and a node of an item:
Here, other items stored in old_bucket are skipped and items stored in new_bucket can come up twice. It significantly limits traversal guarantees we can give for current concurrent_hash_map implementation and it is why we don’t support concurrent traversal officially. However, you might still want to traverse the map if you are fine with a chance of not visiting all your items in one pass or you can apply serial rehash() operation before traversal and so reserve enough buckets to prevent concurrent rehashing. For example, a use-case described here needs just to erase outdated items from the table. So, it is not sensitive to duplications and if in one pass it doesn’t meet some items, it will see them next time.
The code for such a traversal for removing outdated items may be like the following pseudo-code:
[cpp:nocontrols] for(iterator = table.begin(); iterator != table.end(); iterator++ ) {
// To prevent local reordering as discussed in the comments to the blog
_ReadWriteBarrier(); // or acquire fence for weak-ordering processors
// safe w/o concurrent erase but different words may be read inconsistently
// only shallow copy permitted, don't follow pointers
if( is_old( iterator->second.timestamp ) ) { // not protected from concurrent updates
accessor acc; // however key cannot be changed
table.find( acc, iterator->first ); // so get the lock
if( is_old( acc->second.timestamp ) ) // and double-check
table.erase( acc ); // erase by accessor
}
}[/cpp:nocontrols]
Please note that the code takes snapshot without the lock or accessor and so may read some intermediate data but if timestamp is tbb::atomic<> of the word size, it will work fine anyway. The big advantage of this “lockless” approach is that it doesn’t block concurrent operations. If you want to test how does it work in your real hardware, please find the test example attached:
[cpp:collapse]// Example that checks and demonstrates parallel traversal over tbb::concurrent_hash_map.
// Code is by Anton.Malakhov@intel.com
#include <utility>
#include <stdlib.h>
#include <assert.h>
#include \"tbb/concurrent_hash_map.h\"
#include \"tbb/compat/thread\"
typedef tbb::concurrent_hash_map<int,int> table_t;
table_t table;
const int max_keys = 1000000;
void collector() {
while(!table.empty()) for(table_t::iterator it = table.begin(); it != table.end();) {
// safe w/o concurrent erase BUT
// note, it is not protected from concurrent modification by \"consumer\"
std::pair<int,int> snapshot = *it++;
if( snapshot.second < 0 ) { // visited
table.erase( snapshot.first );
assert(snapshot.first == -snapshot.second);
#if VERBOSE
printf(\"%d\r\", snapshot.first );
#endif
} else assert(snapshot.first == snapshot.second);
}
}
void consumer() {
while(!table.empty()) for(int i = 1; i <= max_keys; i++) {
#if VERBOSE
if(!(i%10000)) printf(\"\non %d\n\", i );
#endif
table_t::accessor acc;
if( table.find( acc, i ) ) {
if( acc->second < 0 )
printf(\"Warning: recursion on <%d %d>: bad concurrency?\n\", i, acc->second);
else assert( i == acc->second );
acc->second = -acc->second;
}
}
}
int main() {
table.insert( std::make_pair(max_keys, max_keys) );
std::thread consumer_thread( consumer );
std::thread collector_thread( collector );
for(size_t c = 0; c < 10000000UL; c++) {
int key = 1 + int(rand()) % max_keys;
table.insert( std::make_pair(key, key) );
}
puts(\"awaiting cleanup\");
consumer_thread.join();
collector_thread.join();
puts(\"done\");
return 0;
}[/cpp:collapse]
Although it is not documented, you can traverse tbb::concurrent_hash_map concurrently if you are fine with the following restrictions and possible workarounds for them:
Next time I’ll write about new concurrent hash table with much stronger concurrent traversal guarantees.
Concurrent operations (count, find, insert, and erase) invalidate any existing iterators that point into the table, An exception to this rule is that count and find do not invalidate iterators if no insertions or erasures have occurred after the most recent call to method rehash.
So, officially, you cannot traverse the hash map concurrently with deletion, insertion, and even to lookup operations in some cases. Or in other words, we don’t guarantee that it will work. But what will actually happen if you try? Basically, an iterator doesn’t synchronize its operations with any concurrent operation. It keeps moving from pointer to pointer returning unprotected addresses of items. So, the bad news is that a concurrent erase operation can deallocate the item you are traversing through and you’ll get garbage or an access violation exception. However, the good news is that beside erase() no other concurrent operation can deallocate already published address in TBB >=2.2. It hints that concurrent traversal is nevertheless possible if synchronized with concurrent erasure. Let’s see what can happen while traversing while insert and lookup operations are also occurring.
Current concurrent_hash_map implementation uses classic “separate chaining” data structure for the hash table. It means that each bucket has a pointer to a linked list of stored items. And concurrent resizing can happen if the bucket count is less or close to the number of inserted items. But it works without reallocation of the table as in serial algorithms and leads to specific concurrent rehashing which moves some items from an old bucket with lower index into a new bucket with higher index. The following pseudo code shows the sequence of operations used to move an item and can be executed by any concurrent operation:
[cpp:nocontrols]*prev_pointer = node->next; // exclude item’s node from old bucket
node->next = new_bucket->node_list;
new_bucket->node_list = node; // add item’s node to new bucket[/cpp:nocontrols]
The whole movement operation is executed under locks of corresponding buckets. However, as noted before, iterators don’t respect the locks and they are not visible for iterators except perhaps by some memory ordering. Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size. It means that a thread doing concurrent traversal will not read garbage but on the other hand can read an old, not-yet-updated value. So despite the fact that the whole data-structure will be in a valid state at any given time, an iterator can miss some actually stored items or meet the same item twice. For example, please consider an iterator pointing to an item moved to a new bucket, it has pointers to a bucket and a node of an item:
| Traversal thread | Concurrent rehashing |
[cpp:nocontrols]my_bucket = old_bucket; my_node = my_bucket->node_list; my_node = my_node->next; if( !my_node ) my_node = (++my_bucket)->node_list;[/cpp:nocontrols] | [cpp:nogutter:nocontrols].. .. my_node->next = new_bucket->node_list; .. ..[/cpp:nogutter:nocontrols] |
Here, other items stored in old_bucket are skipped and items stored in new_bucket can come up twice. It significantly limits traversal guarantees we can give for current concurrent_hash_map implementation and it is why we don’t support concurrent traversal officially. However, you might still want to traverse the map if you are fine with a chance of not visiting all your items in one pass or you can apply serial rehash() operation before traversal and so reserve enough buckets to prevent concurrent rehashing. For example, a use-case described here needs just to erase outdated items from the table. So, it is not sensitive to duplications and if in one pass it doesn’t meet some items, it will see them next time.
The code for such a traversal for removing outdated items may be like the following pseudo-code:
[cpp:nocontrols] for(iterator = table.begin(); iterator != table.end(); iterator++ ) {
// To prevent local reordering as discussed in the comments to the blog
_ReadWriteBarrier(); // or acquire fence for weak-ordering processors
// safe w/o concurrent erase but different words may be read inconsistently
// only shallow copy permitted, don't follow pointers
if( is_old( iterator->second.timestamp ) ) { // not protected from concurrent updates
accessor acc; // however key cannot be changed
table.find( acc, iterator->first ); // so get the lock
if( is_old( acc->second.timestamp ) ) // and double-check
table.erase( acc ); // erase by accessor
}
}[/cpp:nocontrols]
Please note that the code takes snapshot without the lock or accessor and so may read some intermediate data but if timestamp is tbb::atomic<> of the word size, it will work fine anyway. The big advantage of this “lockless” approach is that it doesn’t block concurrent operations. If you want to test how does it work in your real hardware, please find the test example attached:
[cpp:collapse]// Example that checks and demonstrates parallel traversal over tbb::concurrent_hash_map.
// Code is by Anton.Malakhov@intel.com
#include <utility>
#include <stdlib.h>
#include <assert.h>
#include \"tbb/concurrent_hash_map.h\"
#include \"tbb/compat/thread\"
typedef tbb::concurrent_hash_map<int,int> table_t;
table_t table;
const int max_keys = 1000000;
void collector() {
while(!table.empty()) for(table_t::iterator it = table.begin(); it != table.end();) {
// safe w/o concurrent erase BUT
// note, it is not protected from concurrent modification by \"consumer\"
std::pair<int,int> snapshot = *it++;
if( snapshot.second < 0 ) { // visited
table.erase( snapshot.first );
assert(snapshot.first == -snapshot.second);
#if VERBOSE
printf(\"%d\r\", snapshot.first );
#endif
} else assert(snapshot.first == snapshot.second);
}
}
void consumer() {
while(!table.empty()) for(int i = 1; i <= max_keys; i++) {
#if VERBOSE
if(!(i%10000)) printf(\"\non %d\n\", i );
#endif
table_t::accessor acc;
if( table.find( acc, i ) ) {
if( acc->second < 0 )
printf(\"Warning: recursion on <%d %d>: bad concurrency?\n\", i, acc->second);
else assert( i == acc->second );
acc->second = -acc->second;
}
}
}
int main() {
table.insert( std::make_pair(max_keys, max_keys) );
std::thread consumer_thread( consumer );
std::thread collector_thread( collector );
for(size_t c = 0; c < 10000000UL; c++) {
int key = 1 + int(rand()) % max_keys;
table.insert( std::make_pair(key, key) );
}
puts(\"awaiting cleanup\");
consumer_thread.join();
collector_thread.join();
puts(\"done\");
return 0;
}[/cpp:collapse]
Bottom-line
Although it is not documented, you can traverse tbb::concurrent_hash_map concurrently if you are fine with the following restrictions and possible workarounds for them:
- You cannot traverse concurrently with an erase() operation. Use a mutex around erase and traversal code if you still need both running in different threads.
- Some items can be missed due to concurrent rehashing. Try to reserve enough buckets before traversal using rehash() to prevent this or repeat the pass.
- Some items can be visited twice or more times. Try to reserve enough buckets before traversal using rehash() to prevent this or put everything to a local serial map first to remove duplicates.
- Access through iterators is not protected. Know what are you doing, e.g. use additional synchronization primitives like tbb::atomic to protect one member or acquire accessor to protect the rest.
Next time I’ll write about new concurrent hash table with much stronger concurrent traversal guarantees.

Comments
Sorry, it does not work that way.
> Fortunately, modern processors read or write pointer locations atomically (assuming the location is properly aligned) because of they are machine-word size
First of all you mix up hardware provided guarantees (that are accessible only via assembly) and abstract C/C++ level operations. Nothing atomic in a racy C/C++ program.
Then, atomicity is not the only concern, mutual ordering is not less important (http://software.intel.com/ru-ru/blogs/2008/09/25/215/). With current implementation it's perfectly possible that iterating thread will observe partially constructed objects (even on x86/SPARC TSO due to compiler reordering).
> Please note that the code takes snapshot without the lock or accessor and so may read some intermediate data but if timestamp is of the word size, it will work fine anyway.
Not quite.
Moreover, if the object contains something along the line of boost/std::shared_ptr<>, one will get paging fault (good case) or memory corruption (bad case) eventually.
Moreover, if one needs to do a check like "if (is_in_range(x, snapshot.second.x1, snapshot.second.x2))", or even worse "for (size_t x = snapshot.second.x1; x != snapshot.second.x2; x += 1)", any kind of bad things can happen (including complete hangup).
In general it's not a kind of problems you want to burden an end user with.
> Moreover, if the object contains something along the line of boost/std::shared_ptr<>
Thanks, it looks as too generic type while I actually meant something simple like std::pair<int,int> for type of snapshot (see the test). So, it should be actually shallow copy. I'll update the blog.
> Moreover, if one needs to do a check like...
Didn't I warn against that kind of issues?
> With current implementation it's perfectly possible that iterating thread will observe partially constructed objects (even on x86/SPARC TSO due to compiler reordering).
..and that? However, I don't agree with "even on TSO" part. There are even two fences between construction of an item and its publication (and one after). Please see check_mask_race (compiler fence for "load with acquire") and insert_new_node (atomic increment). Even weak ordering hardware might work thanks to the last fence (but I'm not sure).
> In general it's not a kind of problems you want to burden an end user with.
So, it is why it is not documented and not official ;-) And it's why blogs are so great place to say something unofficial and dangerous but still cool :)
Iteration works but with described limitations and with completely unprotected access. And it is your decision whether you still want it - it was my message.
> Thanks, it looks as too generic type while I actually meant something simple like std::pair<int,int> for type of snapshot (see the test). So, it should be actually shallow copy. I'll update the blog.
User does not choose the type of the snapshot, it's equal to the hash map element type. I doubt that hash map is used to map int to int in a lot of cases, IMVHO more realistic use case would be map<request_id_t, tuple<timestamp_t, shared_ptr<request_t>>>.
I see your concern. I'll update the pseudo-code to avoid further misconception on snapshot. It is better to read solely key and timestamp.
>> Moreover, if one needs to do a check like...
> Didn't I warn against that kind of issues?
I am not sure. The things are a bit more involved. If copy races with update, update operation must be a carefully crafted synchronization operation too (in particular, implicitly generated functions should not be used). If the item contains any satellite data (pointers), mutex must be used for both update and copy.
> However, I don't agree with "even on TSO" part. There are even two fences between construction of an item and its publication (and one after). Please see check_mask_race (compiler fence for "load with acquire") and insert_new_node (atomic increment). Even weak ordering hardware might work thanks to the last fence (but I'm not sure).
Fence in check_mask_race() does not help in any way, it's an acquire fence.
Fence embed into atomic increment in insert_new_node() does not help according to C++0x memory ordering rules (do not know what TBB memory ordering rules), however will most likely do in practice.
The problem is on iteration side. hash_map_node_base::next is a plain pointer, and iterating thread does a plain load of the next to advance to a next node. That breaks any guarantees. Even x86/SPARC TSO may observe partially constructed object. Memory ordering is always a game of two.
> Iteration works but with described limitations and with completely unprotected access. And it is your decision whether you still want it - it was my message.
It just does happen to work in some particular context. The problem is that you can't define at least limited cases when it guaranteedly works. It may look like it works, but break after compiler switch change.
Consider the following program, which mimics iteration over hash map:
[code=cpp]
struct foo_t
{
foo_t(foo_t* next, int data)
: next(next)
, data(data)
{
count += 1;
}
foo_t* next;
int data;
static int count;
};
int foo_t::count;
__declspec(noinline)
void bar(foo_t* foo)
{
foo = foo->next;
if (foo->count == 0)
exit(-1);
printf("%dn", foo->data);
}
int main()
{
foo_t f1 (0, rand());
foo_t f2 (&f1, rand());
bar(&f2);
}
[/code]
Here is disassembled code for bar() compiled with MSVC2008 Win64/Release:
[code]
__declspec(noinline)
void bar(foo_t* foo)
{
000000013FE11000 sub rsp,28h
foo = foo->next;
if (foo->count == 0)
000000013FE11004 cmp dword ptr [foo_t::count (13FE135E8h)],0
000000013FE1100B mov rax,qword ptr [rcx]
000000013FE1100E jne bar+1Ah (13FE1101Ah)
exit(-1);
000000013FE11010 or ecx,0FFFFFFFFh
000000013FE11013 call qword ptr [__imp_exit (13FE12138h)]
000000013FE11019 int 3
printf("%dn", foo->data);
000000013FE1101A mov edx,dword ptr [rax+8]
000000013FE1101D lea rcx,[string "%dn" (13FE121C0h)]
}
000000013FE11024 add rsp,28h
000000013FE11028 jmp qword ptr [__imp_printf (13FE12128h)]
[/code]
Oops, load of foo_t::count hoist above load of the next pointer. That's how you can get partially constructed object even on x86/SPARC TSO. It's basically impossible to write programs with any reasonable behavior in such a context, not saying about the case when the item contains virtual functions.
Dmitry, I know how compiler reordering works in general. Do you have any evidence that traverse may observe partially constructed object in practice on ia32 or intel64 in real concurrent_hash_map which has compiler and hardware fences between construction of an object and its publication?
TBB_load_with_acquire uses basically the same fence as store with release, i.e. either compiler fence for ia32 or TBB_release_consistency_helper which is also compiler fence on most platforms except PPC. And how compiler could reorder the loads to get something being constructed in a different thread? Even in the same thread, concurrent operations have all the necessary fences to prevent it.
Anyway, using tbb::atomic<> for timestamp will add necessary fences also on the iteration side.
Who cares about too abstract C++0x ordering? Only real hardware and implementation matters.
Pages