Floating-point Settings in Worker Threads May Differ from Master Thread for OpenMP, TBB and Intel Cilk Plus

Reference Number : dpd20087206, bz-1755, dpd20088154

Version : 2011 (Compiler 12.0)

Product : Intel Parallel Composer and Intel Composer XE

Operating System : Windows, Linux, Mac OS X

Problem Description : 
On most operating systems, thread-creation routines do not propagate the floating-point state from the master thread to worker threads. Consequently, settings such as the rounding mode or abrupt underflow, (also known as flush-to-zero), may differ in the worker threads. It is therefore possible that a computation performed in a worker thread may not get the identical result to a similar computation carried out in the master thread. This may occur for applications built with the Intel® Compiler version 12 that use either OpenMP or Intel® Cilk™ Plus, or for applications that use Intel® Threading Building Blocks (TBB), built with any Intel or other compiler, such as gcc.

For example, if an application is compiled on Windows with /O2 /Qopenmp, abrupt underflow would be enabled for the master thread but not for the workers, and a floating-point operation that resulted in a denormalized number for a worker thread would result in zero for the master thread. If a Fortran application was compiled with /Qopenmp /fpe:0, or a C application with /Qopenmp /Qfp-trap:common, to unmask floating-point exceptions, a division by zero occurring in the master thread would raise an exception, but one occurring in the worker thread would not. See the Intel Compiler User and Reference Guide for other switches that might modify the floating-point control word.

Resolution Status : 
For OpenMP, this issue may be worked around by setting the environment variable KMP_INHERIT_FP_CONTROL=1. This will cause the worker threads to inherit the floating-point settings of the master thread at the time of thread creation. This setting has been made the default in the 12.0 compiler update 2 contained in Intel Composer XE 2011 update 2.

For TBB, the issue has been fixed in TBB version 3.0 update 5 and in Intel C++ Composer XE 2011 update 2.
For Intel Cilk Plus, the issue is fixed in the 12.1 compiler contained in Intel Composer XE 2011 SP1 and in subsequent compilers.

[DISCLAIMER: The information on this web site is intended for hardware system manufacturers and software developers. Intel does not warrant the accuracy, completeness or utility of any information on this site. Intel may make changes to the information or the site at any time without notice. Intel makes no commitment to update the information at this site. ALL INFORMATION PROVIDED ON THIS WEBSITE IS PROVIDED "as is" without any express, implied, or statutory warranty of any kind including but not limited to warranties of merchantability, non-infringement of intellectual property, or fitness for any particular purpose. Independent companies manufacture the third-party products that are mentioned on this site. Intel is not responsible for the quality or performance of third-party products and makes no representation or warranty regarding such products. The third-party supplier remains solely responsible for the design, manufacture, sale and functionality of its products. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others.]
For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

In Windows there is also a per-thread exception translation function hook, needed to turn hardware exceptions into C++ exceptions.. And using Microsoft's C++ runtime on x64, once an exception has been caught there does not seem to be any way to get the FPU state on the worker thread back to "normal". It looks like I'm going to have to resort to embedding my own exception processing into a wrapper around the TBB palallel_* functions.

There needs to be an API hook for injecting thread state management into the arena objects. MFC and other frameworks have per-thread requirements too.

Barry Tannenbaum (Intel)'s picture

This issue was resolved for Intel Cilk Plus as of the release of Composer V12.1. The floating point state is saved on each spawn and set on a steal.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.