Intel® TBB parallel_for and its Sandbox of Options

In the last 6 blog posts, I've explained the

  • cilk_for and getting vectorization inside

  • reductions to prevent data races on the shared data

  • creating your own custom vectorized functions

What's great about those is that they are "quick and dirty" - you can get a whole lot of parallelism that you previously had none of. But what if you need much more control over your threading, particularly the for loop? That's when Intel® TBB comes into play. Sure, you can use Array Notations for the added vectorization, but if you need a more skillfully threaded application, Intel® TBB is the way to create a truly task parallel application. If you go to, there's a lot of material already there. I will restrict my discussion to just the more interesting aspects of the parallel_for loop right now.

There are lots of "for loop style" task parallel operations that are templatized. I'm just going to be talking about the most basic concepts for awhile. The most common ways I've seen customers work with the parallel_for are:

  • "building" a parallelized for loop out of templatized components

  • parallel_for with C++0x lambda expression support

You need a very new compiler edition for the latter, and it's not for beginners, so I'm sticking with the first one for today.

The first thing you need to do is include the libraries blocked_range and parallel_for. The first library will help you define your "iteration space" – the space you want to cycle through but still rely on threads behind the scenes to break up and do the desired computation. The second library actually defines the parallel_for template, how it works in C++, and how the threading will take place to parallelize a plain for loop. Obviously the special namespace is necessary too.

#include "tbb/blocked_range.h"

#include "tbb/parallel_for.h"

using namespace tbb;

Next, you define a class basically explain how the regular for loop you want parallelized will work. The function foo is just an arbitrary function. When you are trying things out, first just use a plain old serial function. Later, you could put in a highly-vectorized function that you learned about in the previous blog that uses the Array Notations. Always rule out "compounding problems" that could make it hard for you to figure out what exactly to debug first.

class ChangeArray{

    int* array;


    ChangeArray (int* a): array(a) {}

    void operator()( const blocked_range& r ) const{

        for (int i=r.begin(); i!=r.end(); i++ ){

            foo (array[i]);




That syntax looks a bit hairy, so let me explain it in terms of templatized components to "build" a parallel for loop.

  • The ChangeArray class defines the for-loop body for the parallel_for. So we are essentially defining the for loop that will be threaded across multiple cores.

  • The TBB template blocked_range represents the 1D iteration space.

  • As usual with C++ function objects, the main work is done inside operator().

Now, let's create the function where the TBB parallel_for can be invoked

void ChangeArrayParallel (int* a, int n )


    parallel_for (blocked_range(0, n), ChangeArray(a));


So the parallel for is called based on the inputs to ChangeArrayParallel. It's basically saying, "I have a function which will be called inside of a basic for loop. Please parallelize the for loop which is iterating over this space that I am specifying here."

You can then call ChangeArrayParallel in main

int main (){

    int A[N];

    // initialize array here...

    ChangeArrayParallel (A, N);

    return 0;


For more on what goes on under the hood with the threading, check out the book.

For more complete information about compiler optimizations, see our Optimization Notice.