Controlling floating-point modes when using Intel® Threading Building Blocks

Intel® Threading Building Blocks (Intel® TBB) 4.2 Update 4 introduced enhanced support for managing floating-pointing settings. Floating-point settings can now be specified at the invocation of most parallel algorithms (including flow::graph). In this blog I want to pay attention to some peculiarities and details of the new feature and overall floating-point settings support in Intel TBB. This blog is not devoted to general floating-point support in the CPU. If you are not familiar with floating-point calculation support in the CPU I’d suggest starting with the Understanding Floating-point Operations section in Intel® C++ Compiler User and Reference Guide, or for more information on the complexities of floating-point arithmetic, the classic “What every computer scientist should know about floating-point arithmetic”.

Intel TBB provides two approaches to allow you to specify the desired floating-point settings for tasks executed by the Intel TBB task scheduler:

  1. When the task scheduler is initialized for a given application thread, it captures the current floating-point settings of the thread;
  2. The class task_group_context has a method to capture the current floating-point settings.

Consider the first approach. Basically this approach is implicit: the task scheduler always and unconditionally captures floating-point settings at the moment of its initialization. The saved floating-point settings are then used for all tasks related to this task scheduler. In other words this approach can be viewed as a property of a task scheduler. Since it is a property of a task scheduler it gives us two ways in which we can apply and manage floating-point settings in our application:

  1. A task scheduler is created for each thread so we can launch a new thread, specify the desired settings and then initialize new task scheduler (explicitly or implicitly) on this thread which will capture the floating-point settings;
  2. If a thread destroys a task scheduler and initializes a new one then new settings will be captured. Thus you may specify new floating-point settings before recreation of a task scheduler and when new task scheduler is created, new floating-point settings will be applied for all tasks.

I’ll try to show some peculiarities with a set of following examples:

Notation conventions:

  • “fp0”, “fp1” and “fpx” – some states describing floating-point settings;
  • “set_fp_settings( fp0 )” and “set_fp_settings( fp1 )” – set floating point settings on a current thread;
  • “get_fp_settings( fpx )” – get floating point settings from current thread and store the settings to “fpx”.

Example #1. A default task scheduler.

// Suppose fp0 is used here.
// Every Intel TBB algorithm creates a default task scheduler which also captures floating-point
// settings when initialized.
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp0 will be used for all iterations on any Intel TBB worker thread.
} );
// There is no way anymore to destroy the task scheduler on this thread.

Example #2. A custom task scheduler.

// Suppose fp0 is used here.
tbb::task_scheduler_init tbb_scope;
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp0 will be used for all iterations on any Intel TBB worker thread.
} );

Overall example #2 has the same effect as example 1 but it opens a way to terminate the task scheduler manually.

Example #3. Re-initialization of the task scheduler.

// Suppose fp0 is used here.
{
    tbb::task_scheduler_init tbb_scope;
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // fp0 will be used for all iteration on any Intel TBB worker thread.
    } );
} // the destructor calls task_scheduler_init::terminate() to destroy the task scheduler
set_fp_settings( fp1 );
{
    // A new task scheduler will capture fp1.
    tbb::task_scheduler_init tbb_scope;
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // fp1 will be used for all iterations on any Intel TBB worker 
        // thread.
    } );
}

Example #4. Another thread.

void thread_func();
int main() {
    // Suppose fp0 is used here.
    std::thread thr( thread_func );
    // A default task scheduler will capture fp0
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // fp0 will be used for all iterations on any Intel TBB worker thread.
    }
    thr.join();
}
void thread_func() {
    set_fp_settings( fp1 );
    // Since it is another thread, Intel TBB will create another default task scheduler which will
    // capture fp1 here. The new task scheduler will not affect floating-point settings captured by
    // the task scheduler created on the main thread.
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // fp1 will be used for all iterations on any Intel TBB worker thread.
    }
}

Please notice that Intel TBB can reuse the same worker threads for both “parallel_for”s despite the fact that they are invoked from different threads. But it is guaranteed that all iterations of parallel_for on the main thread will use fp0, and all iterations of the second parallel_for will use fp1.

Example #5. Changing floating-point settings on a user thread.

// Suppose fp0 is used here.
// A default task scheduler will capture fp0.
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp0 will be used for all iterations on any Intel TBB worker thread.
} );
set_fp_settings( fp1 );
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp0 will be used despite the floating-point settings are changed before Intel TBB parallel
    // algorithm invocation since the task scheduler has already captured fp0 and these settings
    // will be applied to all Intel TBB tasks.
} );
// fp1 is guaranteed here.

The second parallel_for will leave fp1 unchanged on a user thread (despite the fact that it uses fp0 for all its iterations) since Intel TBB guaranties that an invocation of any Intel TBB parallel algorithm does not visibly modify the floating-point settings of the calling thread, even if the algorithm is executed with different settings.

Example #6. Changing floating-point settings inside Intel TBB task.

// Suppose fp0 is used here.
// A default task scheduler will capture fp0
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    set_fp_settings( fp1 );
    // Setting fp1 inside the task will lead undefined behavior. There are no guarantees about
    // floating-point settings for any following tasks of this parallel_for and other algorithms.
} );
// No guarantees about floating-point settings here and following algorithms.

If you really need to use other floating-point settings inside a task you should capture the previous settings and restore them before the end of the task:

// Suppose fp0 is used here.
// A default task scheduler will capture fp0
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    get_fp_settings( fpx );
    set_fp_settings( fp1 );
    // ... some calculations.
    // Restore captured floating-point settings before the end of the task.
    set_fp_settings( fpx );
}
// fp0 is guaranteed here.

The task scheduler based approach to managing floating-point settings is suitable for the majority of problems. But imagine a situation where you have two parts of a calculation that require different floating-point settings. It goes without saying that you may use the approaches demonstrated in examples ##3, 4. But you may face some possible issues:

  1. Difficult implementation, e.g. in example #3, you cannot manage the lifetime of the task scheduler object or in example #4, you may need to use some synchronization between the two threads;
  2. Performance impact, e.g. in example #3, you must reinitialize the task scheduler while that was not necessary before, or in example #4, you may face over-subscription issues.

And what about nested calculations with different floating-point settings? With the task scheduler based approach to managing them is not a trivial task since it will force you to write a lot of useless code.

Thus, Intel TBB 4.2 U4 introduced a new task_group_context based approach. task_group_context was extended to manage the floating-point settings for tasks associated with it though the new method

void task_group_context::capture_fp_settings();

which captures the floating-point settings from the calling thread and propagates them to  its tasks. This allows you easily specify the required floating-point settings for a particular parallel algorithm:

Example #7. Specifying floating-point settings for a specific algorithm.

// Suppose fp0 is used here.
// The task scheduler will capture fp0.
task_scheduler_init tbb_scope;
tbb::task_group_context ctx;
set_fp_settings( fp1 );
ctx.capture_fp_settings();
set_fp_settings( fp0 );
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // In spite of the fact the task scheduler captured fp0 when initialized and the parallel
    // algorithm is called from thread with fp0, fp1 will be here for all iterations on any
    // Intel TBB worker  thread since task group context (with captured fp1) is specified for this
    // parallel algorithm.
}, ctx );

Example #7 is not very interesting, since you can achieve the same effect if you specify fp1 before the task scheduler initialization. Let me consider our imaginary problem with two parts of calculation which requires different floating-point settings. The problem can be solved like this:

Example #8. Specifying floating-point settings for different part of calculations.

// Suppose fp0 is used here.
// The task scheduler will capture fp0.
task_scheduler_init tbb_scope;
tbb::task_group_context ctx;
set_fp_settings( fp1 );
ctx.capture_fp_settings();
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // In spite of the fact that floating-point settings are fp1 on the main thread, fp0 will be
    // here for all iterations on any Intel TBB worker thread since the task scheduler captured fp0
    // when initialized.
} );
// fp1 will be used here since TBB algorithms do not change floating-point settings which were set
// before calling.
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp1 will be here since the task group context with captured fp1 is specified for this
    // parallel algorithm.
}, ctx );
// fp1 will be used here.

I have already demonstrated one property of the task group context based approach in examples #7 and #8: it prevails over the task scheduler floating-point settings when the context is specified for an Intel TBB parallel algorithm. Another property is inherent in this approach: nested parallel algorithms inherit floating-point settings from a task group context specified for an outer parallel algorithm.

Example #9. Nested parallel algorithms.

// Suppose fp0 is used.
// The task scheduler will capture fp0.
task_scheduler_init tbb_scope;
tbb::task_group_context ctx;
set_fp_settings( fp1 );
ctx.capture_fp_settings();
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp1 will be here
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // Although the task group context is not specified for the nested parallel algorithm and
        // the task scheduler has captured fp0, fp1 will be here.
    }, ctx );
} );
// fp1 will be used here.

If you need to use the task scheduler floating-point settings inside a nested algorithm you may use an isolated task group context:

Example #10. A nested parallel algorithm with an isolated task group context.

// Suppose fp0 is used.
// The task scheduler will capture fp0.
task_scheduler_init tbb_scope;
tbb::task_group_context ctx;
set_fp_settings( fp1 );
ctx.capture_fp_settings();
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // fp1 will be used here.
    tbb::task_group_context ctx2( tbb::task_group_context::isolated );
    tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
        // ctx2 is an isolated task group context so it will have fp0 inherited from the task
        // scheduler. That’s why fp0 will be used here.
    }, ctx2 );
}, ctx );
// fp1 will be used here.

There is no doubt that it is impossible to demonstrate in one blog all the possibilities of the floating-point support functionality in Intel TBB. But these trivial examples demonstrate the basic ideas for floating-point settings management with Intel TBB and can be applied in real world application.

The main concepts of floating-point settings can be gathered into the following list:

  • Floating-point settings can be specified either for all Intel TBB parallel algorithms via a task scheduler or for separate Intel TBB parallel algorithms via a task group context;
  • Floating-point settings captured by a task group context prevail over the settings captured during task scheduler initialization;
  • By default all nested algorithms inherit floating-point settings from an outer level if neither a task group context with captured floating-point settings nor an isolated task group context is specified;
  • An invocation of an Intel TBB parallel algorithm does not visibly modify the floating-point settings of the calling thread, even if the algorithm is executed with different settings;
  • Floating-point settings that are set after task scheduler initialization are not visible for Intel TBB parallel algorithms if the task group context approach is not used or the task scheduler is not reinitialized;
  • The user code inside a task either should not change floating-point settings, or should restore the previous settings before the end of the task.

P.S. A deferred task scheduler captures floating-point settings when the initialize method is called.

Example #11: An explicit task scheduler initialization.

set_fp_settings( fp0 );
tbb::task_scheduler_init tbb_scope( task_scheduler_init::deferred );
set_fp_settings( fp1 );
init.initialize();
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    // The task scheduler is declared when fp0 is set but it will capture fp1 since it is
    // initialized when fp1 is set.
} );
// fp1 will used be here.

P.P.S. Be careful if you rely on the auto capture property of a task scheduler. It will fail if your functionality is called inside another Intel TBB parallel algorithm.

Example #12. One more warning: beware of library functions.

Code snippet 1. Slightly modified Example #1. It is a valid code and there are no issues.

set_fp_settings( fp0 );
// Run with the hope that Intel TBB parallel algorithm will create a default task scheduler which
// will also capture floating-point settings when initialized.
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {...} );

Code snippet 2.  Just call “code snip 1” like a library function.

set_fp_settings( fp1 );
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> & ) {
    call “code snippet 1”;
}
// Possibly, you will want to have fp1 here but see the second bullet below.

This looks like an innocuous example since “code snippet 1” will set the required floating-point settings and perform its calculation with fp0. But it turns out that this example has two issues:

  1. By the time “code snippet 1” is called, the task scheduler will already be initialized and will have captured fp1. Thus “code snippet 1” will perform its calculations with fp1 and ignore the fp0 setting;
  2. Isolation of user floating-point settings is broken since “code snippet 1” changes floating-point settings inside the Intel TBB task and does not restore the initial ones. That’s why there are no guarantees about floating-point settings after execution of Intel TBB parallel algorithm in “code snippet 2”.

Code snippet 3. Corrected solution.

Let’s fix “code snippet 1”:

// Capture the incoming fp settings.
get_fp_settings( fpx );
set_fp_settings( fp0 );
tbb::task_group_context ctx;
ctx.capture_fp_settings();
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> &r ) {
    // Here fp0 will be used for all iterations on any Intel TBB worker thread.
}, ctx );
// Restore fp settings captured before setting fp0.
set_fp_settings( fpx );

Code snippet 2 remains unchanged.

set_fp_settings( fp1 );
tbb::parallel_for( tbb::blocked_range<int>( 1, 10 ), []( const tbb::blocked_range<int> &r ) {
    call “fixed code snip 1”.
} );
// fp1 will be used here since the “fixed code snippet 1” does not change the floating-point
// settings visible to “code snippet 2”.

 

Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione