TBB containers vs. STL. Performance in the multi-core age.

In my previous post I've tried to explain why TBB containers are so different from the STL ones, and claimed “efficiency” and “performance” considerations to be important reasons. For the starter let me to clarify what exactly did I mean when I talked about them.

The meaning of performance in general is intuitively obvious - the faster your code processes a reference workload on the given platform, the better its performance is deemed to be. So as to not to bring about the anger (and scorn) of mavens upon myself, there is another widespread variant of performance metric – the amount of data processed in some fixed interval of time. These are essentially turnaround vs. throughput considerations, which are selected depending on the nature of the application. But as we are here not to discuss the theoretical depths of performance tuning, let’s get straight to what is important in the context of our topic.

In the “free lunch” era (as Herb Sutter put it in his famous article) the uber-purpose of performance concerned programmers was to minimize the number of clock-ticks taken by any operation in the program. Once a program was written, its performance was growing along with new ever faster chips churned out by the silicon producers. This is about the only scalability a serial program can have, and the programmers play rather passive role here (I do not take into account arcane techniques of dynamic adjusting to the memory-to-cache-to-CPU latencies or to available instruction sets – they are the domain of a precious few experts). This is why even the term scalability is rarely used in connection with the serial code.

In the new multicore world things changed. Definition of scalability acquired a new dimension - how much the program’s performance grows when it is run on the hardware supporting higher concurrency. Again, to appease academicians and HPC stalwarts, this notion has existed and its importance has been acknowledged for ages, it is just that nobody on the broader software market cared about it. Multicore future that was brought upon us by the inevitability of physics laws, instantly made the parallel scalability (simply scalability from now on) the most important performance characteristic of threaded applications.

The very first (if not the only) reason why nowadays programmers have to sacrifice their Quake hours and go at any lengths to thread their applications is the requirement for their code to scale right now, with the same CPUs and memory, because their products need not only to work well on their brand new dual-core laptops but also be twice as fast on quad-core desktop, and (ideally) proportionally faster on the God-only-knows-how-many-core chips that will be released during the next few years. Essentially, if one manages to write a program that scales really well, it will mean that for him the free lunch is served again (and he can make up for his missed game hours to his soul’s content, or keep on threading other apps just for the fun of it).

All right, after I’ve placed accents properly (sorry if you had to read yet another portion of multicore hype for the umpteenth time (but let’s count it as a courtesy to those couple of readers, to whom it was a revelation)) it should not come as a surprise that STL and TBB have very different priorities in terms of efficiency. STL (aka Single Threaded Library or Serial Template Library☺) expectedly goes for the classic serial performance – any operation on a container is implemented to be as fast as possible in assumption that while this operation proceeds nothing else in the world can change the container. Unfortunately, this assumption being very helpful in writing fast code, leads to quite nasty consequences in multithreaded environment.

If an STL container is to be used concurrently, and at least one of the threads adds or removes its items while others read them (or when insertions/removals are concurrent themselves), all operations on this container (including simple reads of its items) will have to be manually protected by the locks. Unfortunately the locks are ruinous for the parallel scalability. If you managed to write your program in a massively parallel manner, but each thread has to access an STL container frequently, this may very likely kill all the potential parallelism in the application. While one thread holds the lock, all the other threads that have arrived at the point where they need to access that container will stop and wait. This is called lock convoying problem. Another issue (though a less frequent one) is priority inversion. Other undesirable side-effects are possible besides or may follow the two mentioned (e.g. cache cooling).

In contrast, TBB concurrent containers are specifically designed in a manner that minimizes their impact on scalability. To this end they use different means, and the first of them are special data layouts. These layouts allow sometimes to avoid locking at all, sometimes to make the locks more fine-grained, and sometimes to lock only a small part of the container data instead of the whole data structure. In some cases TBB containers use locks more sophisticated than simple mutexes – e.g. reader-writer locks. Dynamic memory allocation is used as sparingly as possible. And some of the usage models offered by TBB containers (concurrent_vector growing) promote higher data locality. All these techniques help to decrease significantly the degree of lock contention (and thus convoying) and fight other adverse effects. For those who are interested in more details here is the link to a blog explaining concurrent vector internal organization.

But everything in our life comes at a price. And in case of TBB concurrent containers the price is higher complexity of both data layouts and of the implementation code in comparison with their serial STL counterparts. The immediate consequence of this is the lower serial performance of TBB containers. In some cases the serial overhead of TBB containers is so high that STL ones work faster even when all operations on them are surrounded by locks (especially if those locks are TBB’s spin-mutexes).

Unfortunately it is plain impossible to predict beforehand containers of which kind – TBB or STL – will be more efficient for each particular application. This is because there are several factors that affect who is going to be the winner. Here are the most important ones:

    1. Concurrency (or parallelism) level. The higher it is the stronger is TBB’s position, and the higher is the impact of the rest of the factors.

    1. The frequency of container operations (how big is the part of time the code spends modifying or accessing the container). The higher it is the more gain TBB gives you.

    1. Specific set of operations on the container the application uses (const vs. non-const, insert vs. erase vs. find, etc.). As I mentioned above, TBB containers implement some operations in a lock-free manner, and some operations benefit from using reader-writer locks.
      With the STL you normally use the standard critical section or spin-mutex. Though of course you may experiment with TBB reader-writer mutex yourself.

    1. The amount of data being stored to and/or read from the container.

Thus the most reliable way to estimate which kind of container is more suitable for your application is to compare them experimentally. Implement both versions as soon as the working prototype of the app is up and running, or build a test that models the workflow of the main application in a simplified (in terms of business logic) but still close enough (in terms of containers usage) form. If for example you’ve found out that with a reasonable workload TBB version is works about as fast as STL one on 2 cores, then you can be pretty confident that on 4 cores TBB version will have an upper hand (though you’ll still be better off if you check☺).

In conclusion I’d like to note that TBB containers are not cut in stone, their implementation undergoes constant and sometimes significant changes. For example earlier this year concurrent_vector and concurrent_hash_map have been substantially rehauled, and concurrent_queue behavior was improved. And despite of my inability to provide you with crystal clear answers to your burning questions, one thing I can promise with confidence – TBB team will continue working hard on concurrent containers improvements in order to ensure the most painless and productive parallel programming experience for our users☺.

For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

Great Blog!!

Dmitry Vyukov's picture

Above one must go to your blog "TBB containers vs. STL. Functionality rift".

And wrt performance I would put it this way.
In single-threaded world performance is *a single scalar value*. For example, 100 cycles/operation vs 200 cycles/operation. Clearly, the former is two times better than the latter. Simple!
In multi-threaded world performance is *a one-variable function* P(N), where N is a number of software/hardware threads. For example, P1(1) = 100, P1(2) = 150, P1(4) = 400; P2(1) = 200, P2(2) = 200, P2(4) = 200. Which one is better?

Dmitry Vyukov's picture

I think that it's not only Ok for concurrent containers to have different interfaces, but it's also the right way. They are subject to distinct usage patterns. The early approaches to concurrent containers were single_threaded_container+mutex, so people used to the fact that concurrent container has the same interface accidentally.

michal_hr's picture

I've been playing with TBB for a few days and came to similar conclusion. If I would be developing software dedicated for Multi/Many Cores architecture I would use TBB without hesitation. The problem is with software that has to be efficient both on Single Core and Many Core architectures. Currently there is no universal solution. I wish TBB whould be as efficient as STL on Single Core/Single Thread. :)

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.