TBB 3.0: New (today) Version of Intel Threading Building Blocks

We are happy to introduce Intel® Threading Building Blocks 3.0 (TBB). TBB 3.0 builds on the past four years of TBB by adding to the breadth and depth of features in the library, improving performance, and in the Windows version utilizing the latest developments from Microsoft to support parallelism.

First, I will recap with a brief history of TBB, skip this if you want to rush to the new features. And be sure to read about Adobe at the end of my blog.

History: Outfitting C++ for parallelism

Intel TBB was first available as a commercial library from Intel in the summer of 2006, not long after Intel shipped its first dual-core processors. It provided a much needed comprehensive answer to the question “what must be fixed or added to C++ for parallel programming.” Intel TBB’s key programming abstraction for parallelism focused on programming in tasks, not threads. By also including a task-stealing scheduler, a thread-aware memory allocator, portable locks, global timestamps and concurrent variants of key STL types, Intel TBB truly provided what we needed to program for parallelism in C++.  The first release was primarily focused on strict fork/join or loop-type data parallelism.

We knew that the success of Intel TBB would be limited if it remained a proprietary solution. Even while we released v1.0, we were in discussions with early customers on the future direction of Intel TBB in both features and licensing.

Watching and listening to early adopters, such as Autodesk Maya, highlighted that much of the value of TBB was not only for data parallelism but also for more general parallelism using tasks, pipelines, scalable memory allocation and lower-level constructs like synchronization primitives. And, we got encouragement to make TBB portable by creating a viable open source project.

This customer feedback and encouragement led, only a year later, to version 2.0, which included a GPL v2 with the runtime exception version of both the source and binaries, as well as maintaining the availability of non-GPL binaries. Our customers had told us that this would maximize adoption, and the results have definitely shown they were right.

Intel increased the staffing on Intel TBB, and we worked proactively to build a community to support the project and continued to innovate with new usage models and features over the next few years.  We have been amazed and humbled by the response of customers like Adobe Systems, Avid, Epic Games, DreamWorks, and many others, along with that of our community members.  We have a very large user community, and have had contributions that have led to Intel TBB being ported to many operating systems, platforms and processors. I appreciate Intel’s willingness to let us prove that an open source project initiated by us, and supporting non-x86 processors, not only made sense – but would be very popular with developers. We’ve definitely proven that!

Through the involvement of customers and community Intel TBB has grown to be the most feature-rich and comprehensive solution for parallel application development available today.  It has also become the most popular!

What’s New in TBB 3.0

Taking advantage of C++ 0x (optionally)

Lambdas are a part of the new C++ 0x (draft standard). Intel’s compiler has supported them for while, Microsoft supports them now (Visual Studio 2010), and gcc supports them soon. They really make TBB easier to teach, learn and use, because the code for the algorithm is right there where you want it. TBB supports both pre-lambda and lambda syntaxes. (For a little more on Lambdas – see my blog post on lambda basics, and a follow-on blog post on parallel_for with lambdas). For TBB 3.0, the very popular “pipeline” now has a lambda-friendly and strongly typed version call parallel_pipeline. This allows the building and running of pipelines. In the original pipeline, the type of input and output data was “void*” forcing users to cast the pointer to the right type. With the new parallel_pipeline, input and output types are template parameters, and are not limited to pointer types. The older pipeline interface is still supported and offers a slightly reduced overhead compared to the new implementation.

TBB has anticipated C++ 0x, and working around the lack of it with fixes (some would call them hacks) like movable_exception and captured_exception.  These got around the lack of C++0x support that allows one thread to capture an exception and another thread to rethrow it.  Visual Studio 2010 (and soon gcc) have the necessary C++0x  support (class std::exception_ptr and related machinery).  To provide a smooth migration path, TBB 3.0 introduces TBB_USE_CAPTURED_EXCEPTION, which controls whether the C++0x mechanism is used (C++ 0x if available, or stick with our “hack” which is going to be less precise).  With TBB_USE_CAPTURED_EXCEPTION the user sets their preference when they compile their code, but the actual behavioral change is in the run-time library and therefore adapts to the system run on. Peek at include/tbb/task.h for type_traits if you really want see how we transmit the information from the header to the run-time.

New Synchronization Capability: Condition Variables

There is a popular POSIX synchronization primitive called “condition variable”, which is now in C++ 0x. It is used when threads need to wait for some other thread to update some state protected by a mutex. The waiting thread(s) acquire the mutex, check the state, and decide whether to wait.  They wait on an associated condition variable.  The wait atomically  releases the mutex and starts the wait.  Another thread acquires mutex associated with the condition, modifies state protected by the mutex, then it signals the waiter(s) when it is done (one or all waiters).  Once the mutex is released, the waiters reacquire the mutex and can recheck the state to see if they can proceed or need to continue waiting.  Adding this to TBB helps portability and adds to the collection of TBB sync primitives.

The TBB 3.0 implementation approximates a portion of C++ 0x interfaces for condition variables and its related scoped locking, for threads (not tasks). The approximation is based on the C++0x working draft but implemented in a way that will not conflict with platform vendor’s later implementations, and does not require compiler support for C++ 0x. This gives all TBB users condition variables, with existing compilers, and a clean migration path to later C++ 0x implementations. Condition variables should be the method of choice to have a thread wait until a condition changes.

Design Patterns Manual

TBB 3.0 introduces the TBB Design Patterns manual, a complement to my TBB book or the TBB manuals. It is not an introduction to TBB; it explains how to implement common parallel programming idioms using TBB. The author, Arch Robison the architect of TBB, describes it as a “cookbook” of some common parallel programming patterns and how to implement them using TBB. He offers to teach you to cook and give you recipes; you’ll have to figure out how to be a “great chef” on your own later. Recipes include a look at Odd-Even Communication, Wavefront, Reduction, Divide and Conquer, GUI Thread, Non-Preemptive Priorities, Lazy Initialization and Reference Counting.

New concurrent_unordered_map, better to use this one!

A new container mimicking C++ 0x unordered_map and based on a joint specification implemented by both Intel (TBB 3.0) and Microsoft (Visual Studio 2010). It has three advantages over the previous concurrent_hash_map:

    • An interface closely resembling the C++0x unordered_map

    • It permits concurrent insertion and traversal.

    • No locking is exposed by the interface. An implementation may use locks internally, but the locking is never exposed in a way that can contribute to deadlock. It may hold locks internally, but never while calling user defined code.



TBB retains concurrent_hash_map too, which permits concurrent erasure. It should be used when concurrent erasure is needed as it may offer some performance advantages depending on usage.

Windows support for the latest Microsoft Visual Studio 2010

Microsoft just recently released Visual Studio 2010 (on April 12), which brings several new things to Windows for parallelism. Specifically ConcRT, PPL and TPL. ConcRT is Microsoft’s Concurrent Runtime which offers standard Microsoft interfaces for Windows parallel programming models from all vendors to be coordinated so as to avoid catastrophic over-subscription through excessive duplication of thread pools. Intel TBB 3.0 includes support to use ConcRT (you remain in control – you control usage based on the model you link with), and our OpenMP implementation will do the same in the next release of our compiler (which is in beta test now). This means programs can continue to mix and match programming models in a single application, and it will continue to work in the future. PPL and TPL are Microsoft’s parallelism solutions for native code and .NET (C#, Visual BASIC, etc). PPL and Intel TBB are coordinated to offer semantically compatible interfaces, and identical concurrent STL container solutions. PPL does not offer all the algorithms as Intel TBB. This will give developers a choice on Windows, and allow for porting to other operating systems (with Intel TBB). I do not expect any one using Intel TBB to consider PPL, but Microsoft will be introducing the power of task-stealing to many native and .NET developers, and that is very good.

We also introduced a few lock interfaces specifically to match reader_writer_lock and critical_section in Microsoft’s Visual Studio 2010. These are portable and do not require Windows. These classes exist for compatibility with the Microsoft, and as such they can not follow all of the conventions of other mutexes in TBB.

More debuggable TBB using scheduler changes and hints to analysis tools

TBB has improved task scheduling predictability and responsiveness via independent task scheduling for different master threads. This makes TBB behavior more responsive and, in some sense, more predictable as now there is a guarantee that the master thread will return from a parallel algorithm as soon as all tasks related to this algorithm are complete. This is particularly useful for the apps that run parallel algorithms from tbb:thread.

By keeping master threads isolated, TBB 3.0 gets rid of some conditions where you could unwittingly create deadlocks  Now, the scheduler will not create dependency chains where one of your threads waits on one of your other threads. (The scheduler really does prevent moonlighting.  A worker working on behalf of one master thread will never steal work from another master thread.) In older versions, this could happen when a TBB worker thread working for one of your threads had to wait on other tasks, and in the meantime “moonlighted” by working on a task from your other thread. This feature avoids deadlock in some important use cases, such as two pipeline running in two threads that you created.

Intel TBB 3.0 includes additional information for tools so that false positives can be reduced through sharing of what TBB “knows.” TBB generates and provides more detailed information about what the user wrote such as specifics about which synchronization primitives were used. Intel tools support using this information. You’ll want to try out Intel TBB 3.0 with Intel® Parallel Amplifier and Intel® Parallel Inspector.

Trying to avoid starvation (of tasks) for queue-like work

The TBB 3.0 schedule supports task::enqueue, which is effectively a “run me after the other things already pending” request. Although similar to spawning a task, an enqueued task is scheduled in a different manner.  Enqueued tasks are valuable when approximately first-in first-out behavior is important, such as in situations where latency of response is more important than efficient throughput.

Think of these as “Fairness-oriented task scheduling for queue-like work.” They are useful for emulating task priorities and interacting with GUI threads (see the new TBB Design Patterns Manual).

Performance

Two notable performance improvements in the TBB project for 3.0 include:

    • Classes enumerable_thread_specific and combinable are now significantly faster in typical use cases.

    • The memory allocator has refinements, notably in the allocation of large blocks to improve efficiency. The memory allocator remains one of the very best, and it just got better.  I’ve poured over some very cool graphs the team created showing measurement on various systems (numbers of core, with and without hyperthreading, etc.) showing performance under test. I’ll see if they can clean them up and post them on the project page for those that are interested.



Usability

A task group context can now be created/destroyed in different threads, which makes tasks cancellation and exception handling more usable. The change just removed a limitation of the prior implementation.

Adobe

We have so many great users; it might not be fair to single one out. But since Adobe just started shipping their new Create Suite 5 products, I can’t write a blog and not mention that. Adobe is now using TBB in shipping products. Since they went into release cycle before we released Intel TBB 3.0, they are on 2.x, but they see the benefits we’ve all come to expect from Intel TBB in harnessing multicore parallelism. Given how widely Adobe, and other customers, products are used – the power of multicore parallelism is being realized in the real world by many users. Users of their product will enjoy the new found performance. And we are grateful for the support and extraordinarily useful feedback our customers give to the Intel TBB project. That real world usage has made Intel TBB what it is today.

Download the new version now!

Get a copy and learn more at http://threadingbuildingblocks.org (open source) or http://threadingbuildingblocks.com (commercial). The functionality is the same, only the license is different.

Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.