The Work Isolation Functionality in Intel® Threading Building Blocks (Intel® TBB)

Nowadays, almost all software developers work with multi-threaded CPUs, and so have to face and solve parallel programming problems. The majority of these problems, such as parallel execution, synchronization, and data races, are well known. The paradigm of task parallelism abstracts low-level thread management with high level parallel constructs to reduce the number of problems to manage. However, it cannot solve all the problems automatically and some peculiarities still need to be considered.

Intel® Threading Building Blocks (Intel® TBB) [1] is an open source [2] C++ template library that utilizes task parallelism and high level patterns to simplify parallel programming. In this article, we consider work isolation, a feature introduced in Intel TBB 2017 [3] to overcome hidden and annoying issues that show up in some use cases.

Let us consider a small example from Intel TBB documentation [7] that illustrates a possible problem:

tbb::enumerable_thread_specific<int> ets;
tbb::parallel_for( 0, N1, [&ets]( int i ) {
    // Set a thread specific value
    ets.local() = i;
    tbb::parallel_for( 0, N2, []( int j ) { /* Some work */ } );
    // While executing the above parallel_for, the thread might have
    // processed a task of the outer parallel_for, and so might have
    // changed the thread specific value.
    assert( ets.local()==i ); // The assertion can fail!
} );

Even though the example is simple, it is not clear why the assertion can fail. To understand this problem, we must also understand some aspects of the TBB task scheduler [4].

The scheduler implements the work stealing paradigm [5]. Each worker thread has its own deque (double-ended queue) of tasks ready for execution. When looking for a task to execute, a worker thread first tries to process tasks in its local deque. When the local deque becomes empty, the thread tries to obtain (steal) a task from other threads’ deques. The idea looks simple; however, it should be noted that worker threads are not aware of what they steal, and can steal tasks generated by outer parallel loops as well as tasks generated by inner/nested parallel loops. Usually, this behavior is desirable because it exposes additional parallelism for better CPU utilization.

However, consider this situation: in the above code, a thread starts processing a task generated by the outer parallel loop. It sets a thread-local value and enters the nested parallel loop. It may happen that other worker threads steal tasks generated by the nested loop, and then the first thread tries to find more work. The stealing policies are not aware of task nesting; therefore, the first thread can steal a task generated by the outer parallel loop and start processing it. It will again assign a value to the thread-local variable, overwriting the initial value. Hence, when the thread finishes the second task generated by the outer level loop and returns from the nested loop, it will read the updated value of the thread-local variable instead of the initial one. 

The following animated picture shows this situation:

Task stealing example

  1. Thread 2 starts the outer parallel loop (executes the task P1);
  2. Thread 2 splits the iteration range and spawns an additional task to be stolen by other threads (P1 in the task pool of thread 2);
  3. Thread 1 steals the task P1 and starts its execution. At the same time thread 2 sets the initial value to the thread-local variable and enters the nested parallel loop;
  4. Thread 1 continues splitting of the iteration range and spawns an additional task (P1 in the task pool of thread 1). Thread 2 splits the iteration range of the nested parallel loop (P2 in the task pool of thread 2);
  5. Thread 3 steals the task generated by the nested parallel loop;
  6. Thread 2 steals the task generated by the outer level parallel loop and overwrites the previous value in the thread-local variable;
  7. Thread 2 finishes the task generated by the outer parallel loop and returns to the task generated by the nested parallel loop;
  8. Thread 2 finishes the task generated by the nested parallel loop and returns to the initial task generated by the outer parallel loop;
  9. Thread 2 fails on the assertion because the value was overwritten in step 6.
The TBB task scheduler cannot detect that it is unsafe to execute an outer-level task on a nested level. We do not want to always limit stealing for sake of this potential issue, as doing so can affect performance in other cases. Theoretically, we can use coroutines [6] to create a new stack for each outer-level task, preventing its execution on a nested level. However, coroutines are not a part of the C++ language yet, moreover, they might introduce additional overhead that would break the pay-as-you-go principle for the cases that do not require work isolation. Therefore, we decided to implement a special interface that can be used explicitly to prevent stealing from the outer level:
namespace tbb {
    namespace this_task_arena {
        template<typename F> void isolate( const F& f );
    }
}

This isolation interface ensures that the user-provided functor runs without being affected by other tasks inside the scheduler. Let us call such functors “isolated regions”. Within an isolated region, any blocking wait (e.g. tbb::parallel_for) can only process tasks spawned in this region. If a thread blocked inside the isolated region tries to steal a task from another thread, firstly, it will examine whether the isolation constraint is met. If the task satisfies the isolation constraint, it will be executed; otherwise, the thread will skip it and try to find another task which does satisfy the constraint.

We can rework the example to prevent undesired behavior by guarding the nested parallel loop with this_task_arena::isolate:

tbb::enumerable_thread_specific<int> ets;
tbb::parallel_for( 0, N1, [&ets]( int i ) {
    // Set a thread specific value
    ets.local() = i;
    // Run the second parallel loop in an isolated region to prevent
    // the current thread from taking tasks generated by the outer
    // parallel loop.
    tbb::this_task_arena::isolate( []{
        tbb::parallel_for( 0, N2, []( int j ) { /* Some work */ } );
    } );
    assert( ets.local()==i ); // Valid assertion
} );
Let’s consider the main properties of the work isolation feature:
  1. The isolation only constrains threads that entered or joined an isolated region. Worker threads outside of an isolated region can take any tasks, including the tasks spawned in an isolated region. 
  2. When a thread without isolation executes  the task spawned in isolated region, it joins the region of this task and becomes isolated until the task is completed.
  3. Threads waiting inside an isolated region cannot process tasks spawned in other isolated regions (i.e. all regions are mutually isolated). Moreover, if a thread within an isolated region enters a nested isolated region, it cannot process tasks from the outer region.
These properties are quite simple; nevertheless, the third one carries the hazard of starvation or even a deadlock. Interfaces that spawn tasks separately from waiting, such as tbb::task_group and the TBB flow graph, must be used with caution when mixed with work isolation. A task spawned in one isolated region cannot be waited for or processed inside a different isolated region. The essential rule is to create tasks and wait for their completion inside the same isolated region. Let us consider the following example:
void foo() {
    tbb::task_group tg;
    for ( int i=0; i<N; ++i ) {
        // We want to isolate each task of the task group to
        // process only particular parallel loop and prevent
        // processing parallel loops run inside other tasks.
        tbb::this_task_arena::isolate( [&tg]{
            tg.run( []{
                tbb::parallel_for( ... );
            } );
        } );
    } );
    // Wait for work completion.
    tg.wait();
}

int main() {
    foo();
    return 0;
}

Each task in the task group is spawned in its own isolated region. According to the first property of work isolation, the parallel loop inherits the isolation, and a thread running a parallel loop cannot process tasks generated by other parallel loops. The main thread waits outside the isolated region and can process any task generated by the task group or the parallel loops.

Strictly speaking, this example is correct and there is no issue. However, it uses an error-prone approach of running tasks in an isolated region but waiting for their completion elsewhere. Let us suppose that we want to call the function foo in parallel, and also isolate each call to foo to prevent it from executing tasks generated by the outer parallel loop:

int main() {
    tbb::parallel_for( 0, N, []( int ) {
        // Isolate foo() to prevent it processing tasks
        // generated by the outer parallel loop.
        tbb::this_task_arena::isolate( [] {
            foo();
        } );
    }
    return 0;
}

We have not changed the function foo, just called it in parallel in isolation; however with these changes the code becomes incorrect and will lead to a deadlock with high probability. The isolation changes the behavior of the thread that waits for the task group to complete. According to the second property of work isolation, the thread cannot process anything because there are no other tasks in its isolated region. As a result, the thread is simply blocked. If there are enough iterations in the outer level parallel loop, this can cause all worker threads to be blocked.

To solve the deadlock, we need to strictly follow the rule that running work and waiting for its completion should be done inside the same isolated region. Therefore, the function foo should be reworked in such a way that adding new work to the task group is not isolated away from the wait method:

void foo() {
    tbb::task_group tg;
    for ( int i=0; i<N; ++i ) {
        // The run method should be called in the same isolated 
        // region as the wait method, i.e. outside the isolated region.
        tg.run( []{
            // We want to isolate each task of the task group to
            // process only particular parallel loop and prevent 
            // processing parallel loops run inside other tasks.
            tbb::this_task_arena::isolate( [&tg]{
                tbb::parallel_for( ... );
            } );
        } );
    } );
    // Wait for work completion.
    tg.wait();
}

int main() {
    tbb::parallel_for( 0, N, []( int ) {
        // Isolate foo() to prevent it processing tasks
        // generated by the outer parallel loop.
        tbb::this_task_arena::isolate( [] {
            foo();
        } );
    }
    return 0;
}

With the fix, the thread waiting for the task group to complete can process tasks from the task group while the nested parallel loop remains isolated.

Despite its simple interface, the work isolation feature is not trivial in use because it is often not obvious that isolation should be applied. To understand whether isolation is required or not, follow the 3-step logic:

  1. Is nested parallelism used (even indirectly, through third party library calls)? If not, isolation is not needed; otherwise, go to the next step;
  2. Is it safe for a thread to reenter outer level parallel tasks (as if there was recursion)? Storing to a thread local value, re-acquiring a mutex already acquired by this thread, or other resources that can suffer if used by the same thread again can all cause problems. If reentrance is safe, isolation is not needed; otherwise, go to the next step;
  3. Isolation is needed. Nested parallelism has to be called inside the isolated region. Special attention should be paid to algorithms with separated running and waiting interfaces to guarantee that both are called inside the same region.
Interestingly, the first version of Intel TBB was introduced more than 10 years ago [8][9] but the work isolation feature was only implemented quite recently. There are many reasons that influenced our decision to implement this feature now. Obviously, the behavior of the task scheduler that allows executing outer level tasks in a nested level has not led to issues too often; however, because of the popularity of Intel TBB, it is now used in many libraries and solutions. As a result, complex applications with many components can use Intel TBB indirectly or even unintentionally. In addition, new interfaces such as the TBB flow graph, which can be used as a coordination level across the application and multiple modules, increase the demand for work isolation functionality.
 
References:
[1] Intel® Threading Building Blocks. https://software.intel.com/en-us/intel-tbb.
[2] Intel® Threading Building Blocks. Open source site. https://www.threadingbuildingblocks.org.
[3] What's New? Intel® Threading Building Blocks 2017. https://software.intel.com/en-us/articles/whats-new-intel-threading-building-blocks-2017.
[4] Intel® Threading Building Blocks documentation: Task Scheduler. https://software.intel.com/en-us/node/506294.
[5] Wikipedia: Work stealing. https://en.wikipedia.org/wiki/Work_stealing.
[6] Wikipedia: Coroutine. https://en.wikipedia.org/wiki/Coroutine.
[7] Intel® Threading Building Blocks documentation:  Work Isolatioin.
https://www.threadingbuildingblocks.org/docs/help/index.htm#tbb_userguide/work_isolation.html.
[8] Celebrating a Decade of Parallel Programming with Intel® Threading Building Blocks (Intel®TBB). https://software.intel.com/en-us/blogs/2016/06/10/celebrating-a-decade-of-parallel-programming-with-intel-threading-building-blocks.
[9] Intel® Threading Building Blocks Celebrates 10 Years!. https://software.intel.com/en-us/articles/intel-threading-building-blocks-celebrates-10-years.
 
For more complete information about compiler optimizations, see our Optimization Notice.