Version 2.2, Intel Threading Building Blocks, worth a look

If you write C or C++ code, and you haven’t given Intel Threading Building Blocks (TBB) a try, you really should. Intel Threading Building Blocks has emerged as the most popular high level programming method for writing parallel programs (see Evans Data Corp: http://www.evansdata.com/research/market_alerts.php). The low level methods (using pthreads or Windows threads directly) popular before high level methods existed should be avoided by those writing new parallel programs because of their substantial learning curve, plus their high costs to create and maintain.

C programmers will want to take another look at Intel Threading Building Blocks (TBB) which has been popularized primarily by C++ programmers. Because C++ didn't have lambda functions, too much of C++ templates showed through when coding common operations. It was intimidating unless you know and like C++ templates. With version 2.2 and the latest compilers, lambda functions let coding with Intel TBB reasonable for C programmers too (using C++ compiler of course!)

Whether you are new to Intel TBB, or a current user, you’ll want to know about the latest version – 2.2. Intel TBB 2.2 can help you improve the scalability and portability of your code while being productive writing parallel programs.

Version 2.2 of Intel TBB is now available, in both the commercial and open source releases. These are built from identical sources – the only real difference is the license and support offerings. Get a copy and learn more at threadingbuildingblocks.org (open source) or threadingbuildingblocks.com (commercial).

Small version number change, but lots to offer

Intel TBB 2.2 maintains the functionality and platform support of previous versions and adds numerous feature and performance improvements, including full support for the lambda capabilities of the new C++ draft standard (C++0x) and more flexibility for developers to redistribute with their applications. Autodesk Maya and Epic Games Unreal Engine are among the applications that will be reshipping some or all of Intel TBB 2.2 to support their developers.

I’m not completely used to the small version increments common with open source projects. I’d have no trouble considering this version 3.0 or 4.0 of TBB as a commercial-only product. Yet 2.2 seems fitting from a point of being modest – a bit understated.

This release is packed with a bunch of additions, which continue to show the maturity you’d expect from a package as popular as Threading Building Blocks has proven itself to be. Users give great feedback, and that leads to improvements.

Automatic memory allocator replacement available

The memory allocator is one of the most popular features of Intel TBB. However, it can be time consuming to replace your own memory allocator calls. Version 2.2 uses a dynamic instrumentation method on Windows and the LD_PRELOAD function on Linux to offer automatic memory allocator replacement throughout your application.

Ron Henderson at DreamWorks Animation summed it up: "The Intel® TBB malloc was an important tool in achieving good parallel speedups for our threaded applications, and a drop-in replacement for the memory allocator in the C standard library."

Memory allocator faster than ever

Version 2.2 extends the performance lead of Intel TBB’s memory allocator's performance over the competition by delivering even better large-block (over 8K in size) allocation performance.

Scaling of scheduler enhanced significantly

Version 2.2 features a reworked the task scheduler to behave more like an ideal Cilk-style scheduler, yielding even more scalable behavior. True to the promise of using Intel TBB - the benefits of this work come to programs written using Intel TBB without requiring any code changes. Version 2.2 also has improvements to the affinity partitioner, and changes the default for loop templates from the simple_partitioner to the easier to use and adaptive auto_partitioner.

Automatic initialization available

Version 2.2 no longer requires an explicit initialization. Users of prior versions have told us that in a large application it is not easy to initialize in the right place. Version 2.2 takes care of automatically initializing the scheduler when it is first needed.

Parallel algorithms enhancements

  • Version 2.2 has a new parallel_invoke for running a group of functors simultaneously in parallel.
  • Version 2.2 has a new parallel_for_each and a simplified parallel_for interface to make writing some common for loops easier.
    • parallel_for_each(first, last, f) is like parallel_do(first, last, body) but without the feeder functionality that allows adding more work items. In other words, tbb::parallel_for_each is the parallel equivalent of std::for_each.
    • The new overload parallel_for(first, last, step, f) allows you to pass an integer first (auto i=first), last (i<last), and step (i+=step) for a given function f(i), handles simple cases easily, especially with the use of lambdas. The original interface parallel_for(range, body, partitioner) has been retained. It's more general but also more complicated to write, even with the use of lambdas.
  • Intel TBB's pipeline can now perform DirectX, OpenGL, and I/O parallelization by using the new thread_bound_filter feature. There are certain types of operations that require that they are used from the same thread every time and by using a filter bound to a thread, you can guarantee that the final stage of the pipeline will always use the same thread.
  • Exception safety support has been expanded significantly. Prior versions had support for exception propagation only in parallel_for, parallel_reduce and parallel_sort. Support is expanded to include parallel_do, the new paralle_invoke and parallel_for_each as well as the new forms of parallel_for and parallel_reduce.
  • Lambda support has been extended to cover not only parallel_for, but also parallel_reduce, parallel_sort, and the new parallel_for_each and parallel_invoke algorithms. In addition, the new combinable and enumerable_thread_specific classes for thread local storage can accept lambdas. The documentation and code examples are expanded to show lambdas in action. The Intel® Compiler 11.0 and Intel® Parallel Studio offer lambda support today, and Microsoft will support it in Visual Studio 2010 (it is in the beta currently). Based on feedback, I expect lambdas to be easily one of the most used features of the new C++ standard. It certainly makes code using Intel TBB easier to read – hence our long desire to see them a part of C++ (there is a section in my 2007 book about the desire for lambdas – we are very happy to have them now!). See "Hello Lambdas" C++ 0x, a quick guide to Lambdas in C++ for more background on lambdas, and see parallel_for is easier with lambdas, Intel Threading Building Blocks for more on parallel_for and lambdas.

Concurrent container enhancements

  • Thread local storage, which is portable across platforms, is now possible with the new enumerable_thread_specific and combinable classes. This can be useful for algorithms that reduce shared memory contention by creating local copies and then combining results later through something like a reduce operation.
  • Unbounded non-blocking interface for concurrent_queue and new blocking concurrent_bounded_queue. Some operations require synchronization and may or may not block depending on whether or not the queue is bounded. To get the best behavior, use the unbounded form if you need only basic non-blocking push/try_pop operations to modify the queue. Otherwise use the bounded form which supports both blocking and non-blocking push/pop operations.
  • Simplified interfaces for concurrent_hash_map that make it easier to utilize for common data types using the new tbb_hasher.
  • Improved interfaces for concurrent_vector that removes a common extra step needed to use the vector output.

Redistribution is easier

The licensing of the commercial version has been modified to allow redistribution of required DLLs and header files. This means you can redistribute DLLs and header files from version 2.2 with your application, to enable your customers to write Intel TBB code that will use the master applications DLLs and therefore the same infrastructure.

Also, Intel is offering additional redistribution rights for commercial customers who need more than just the DLLs and header files. If that is of interest, drop us a line and we’ll talk.

Of course, none of this really matters for the open source version – but if the nuisances of using the commercial version has you wanting more – you should ask as Intel is trying to help out.

This effectively makes Intel TBB freely available for the strong community of developers that support some of the world's best software. Gordon Bradley with the Maya Performance Team Lead at Autodesk summed it up: "The Maya team has successfully used Intel's TBB technology to internally parallelize Maya for several releases. Now thanks to Intel, TBB 2.2 lets Maya plug-in developers access the same advanced parallelism features that we've used at no additional charge."

Current users have a little work to do to upgrade

There are some changes you may need to do to move from prior versions of Intel TBB to the new 2.2 version. Personally, I don’t like doing anything to upgrade from one version to another – but sometimes it is necessary. You can simply add "#define TBB_DEPRECATED 1" to your code, and the old interfaces remain available to you (at least for now) – or adjust to the following changes:

  • auto_partitioner() is now the default instead of simple_partitioner(). To this I say: it’s about time! When I wrote my book on Intel TBB, I included auto_partitioner despite some concerns from the TBB team that it was new and somewhat experimental! Well – the writing was on the wall… this was the way to go! Now it’s the default. Of course, if you specified a perfect grain size, you might see a slow-down. In such a case, you should specify simple_partitioner() explicitly and drop us a note telling us about it – we’d like to know if the auto_partitioner() is not good enough. Or, you can use TBB_DEPRECATED to force the old default.
  • Concurrent queue API changes: renaming four interfaces. Or you can change pop_if_present to try_pop, push_if_not_full to try_push, begin to unsafe_begin and end to unsafe_end, and thereby be consistent with the latest API.
  • Concurrent vector API changes: renamed compact to shirnk_to_fit, and changed three interfaces to all consistently have return types of iterator. Previously grow_by returned size_type, grow_to_at_least returned nothing, push_back returned size_type.
  • The notion of task depth has been eliminated, so the following four members of class task have no effect: depth_type, depth, set_depth and add_to_depth. These have no effect in 2.2 even if you use TBB_DEPRECATED, but are nonetheless defined to permit their use without error messages.

Try it today!

Get a copy and learn more at threadingbuildingblocks.org (open source) or threadingbuildingblocks.com (commercial.)

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

I also purchased the 'Intel Threading Building Blocks' book. Unfortunately the book is from 2007, so here and there one might encounter differences.



I purchased the Intel Parallel Studio product within the last month. Will this upgrade from TBB 2.1 to TBB 2.2 be made available to Parallel Studio customers?





Epic has integrated Intel TBB into the August 2009 version of the engine, which is the first version of the engine to fully support 64-bits including their developer tools. Read more at http://www.epicgames.com/press_releases/intel_tbb2_2_ipp.html