Introduction to Conditional Numerical Reproducibility (CNR)

Starting with 11.0 release,  Intel® MKL introduces a feature called Conditional Numerical Reproducibility (CNR) which provides functions for obtaining reproducible floating-point results when calling library functions from their application.  When using these new features, Intel MKL functions are designed to return the same floating-point results from run-to-run, subject to the following limitations:

  • calls to Intel® MKL occur in a single executable
  • the number of computational threads used by the library remains constant throughout the run

In MKL versions before 11.1, there was one more limitation: Input and output arrays in function calls must be aligned on 16, 32, or 64 byte boundaries on systems with SSE / AVX1 / AVX2 instructions support (resp.). MKL 11.1 has dropped this requirement. CNR can be obtained on unaligned input arrays, but aligning data will typically lead to better performance.

It is well known that for general single and double precision IEEE floating-point numbers, the associative property does not always hold, meaning (a+b)+c may not equal a +(b+c).  Let's consider a specific example. In infinite precision arithmetic 2-63 + 1  + -1 = 2-63. If instead we do this same computation on a computer use double precision floating-point numbers, rounding error is introduced and we clearly see why order of operations becomes important:

(2-63 + 1) + (-1) ≈ 1 + (-1) = 0

versus

2-63 + (1 + (-1)) ≈ 2-63 + 0 = 2-63

This inconsistency in results due to order-of-operations is precisely what the new functions are designed to address.

The application related factors that affect the order of floating-point operations within a single executable program include code-path selection based on run-time processor dispatching, data array alignment, variation in number of threads, threaded algorithms and internal floating-point control settings. Most of these factors can be controlled by the user by properly controlling the number of threads, floating point settings and taking steps to align memory when it is allocated (see this previous article on getting reproducible results). On the other hand run-time dispatching and certain threaded algorithms have not allowed users to make changes that can ensure the same order of operations from run to run.

Intel MKL does run-time processor dispatching in order to identify the appropriate internal code paths to traverse for the Intel MKL functions called by the application. The code paths chosen may differ across a wide range of Intel processors and IA compatible processors and may provide differing levels of performance. For example, an Intel MKL function running on an Intel® Pentium® 4 processor may run an SSE2-based code path, while on a more recent Intel® Xeon® processor supporting Intel® Advanced Vector Extensions (AVX), that same library function may dispatch to a different code-path that uses these AVX instructions. This is because each unique code path has been optimized to match the features available on the underlying processor. The feature-based approach introduces a challenge: if any of the internal floating-point operations are done in a different order, or are re-associated, then the computed results may differ.

Dispatching optimized code-paths based on the capabilities of the processor on which it is running is central to the optimization approach used by Intel MKL so it is natural that there should be some performance trade-offs when requiring consistent results. If limited to a particular code-path, Intel MKL performance can in some circumstances degrade by more than half. This can be easily understood by noting that matrix-multiply performance nearly doubled with the introduction of new processors supporting AVX instructions. In other cases, performance may degrade by 10-20% if algorithms are restricted to maintain the order of operations.

Intel® MKL 11.0 includes new functions and environment variables, shown in figures 1, 2, and 3  designed to help users get bitwise reproducible results  from the Intel MKL functions used (hence conditional bit-wise reproducibility and the use of CBWR).  To better understand how to use these features, some usage examples are provided below. Only the MKL_CBWR_COMPATIBLE option is supported on non-Intel CPUs.

To ensure MKL calls return the same results on all Intel or Intel compatible CPUs supporting SSE2 instructions or later make sure your application uses a fixed number of threads, in/output arrays in Intel MKL function calls are aligned properly, and call

mkl_cbwr_set(MKL_CBWR_COMPATIBLE)

or set the environment variable

MKL_CBWR = COMPATIBLE

Note: the special MKL_CBWR_COMPATIBLE option is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses an SSE2 only code-path which does not use these instructions.

 

To ensure MKL calls return the same results on every Intel CPU that supports SSE2 instructions or later make sure your application uses a fixed number of threads, in/output arrays are aligned properly, and call

mkl_cbwr_set(MKL_CBWR_SSE2)

or set the environment variable

MKL_CBWR = SSE2

Note: on non-Intel CPUs the results may differ because the MKL_CBWR_COMPATIBLE is run instead.

To ensure MKL calls return the same results on every Intel CPU that supports SSE4.1 instructions or later make sure your application uses a fixed number of threads, in/output arrays are aligned properly, and call

mkl_cbwr_set(MKL_CBWR_SSE4_1)

or set the environment variable

MKL_CBWR = SSE4_1

Note: on non-Intel CPUs the results may differ because the MKL_CBWR_COMPATIBLE is run instead.

Ensure MKL calls return the same results on every Intel CPU that supports AVX instructions or later make sure your application uses a fixed number of threads, in/output arrays are aligned properly, and call

mkl_cbwr_set(MKL_CBWR_AVX)

or set the environment variable

MKL_CBWR = AVX

Note: on non-Intel CPUs the results may differ because the MKL_CBWR_COMPATIBLE code-path is run instead. On an  Intel CPU without AVX support, the MKL_CBWR_DEFAULT path is run instead.

Please consult the user guide for additional details.

For more complete information about compiler optimizations, see our Optimization Notice.