I am tuning my application with VTune, and I have found that in an area of my code that does parallel traversal of a concurrent_unordered_map, I get zero parallelism. After this part of the code completes, I get a high amount of concurrency.
In particular, I use a parallel_for, with tbb::concurrent_unordered_map::range() to do some work for each (key, value) entry within the map. I suspect that the lack of concurrency is due to interal waits, or some form of synchronization that is occurring within the implementation. It could also happen that the range() is not providing sufficient work for the worker threads to participate.
Any suggestions (or experience) with slow traversal via range() ?