Introduction to Conditional Numerical Reproducibility (CNR)

Starting with 11.0 release,  Intel® MKL introduces a feature called Conditional Numerical Reproducibility (CNR) which provides functions for obtaining reproducible floating-point results when calling library functions from their application.  When using these new features, Intel MKL functions are designed to return the same floating-point results from run-to-run, subject to the following limitations:

• calls to Intel® MKL occur in a single executable
• the number of computational threads used by the library remains constant throughout the run

Intel® MKL 2019 Update 3 introduces a new strict CNR mode for a small set of functions, which is discussed in more detail at the end of this article. When enabled, functions supporting strict CNR mode will return exactly identical results even if the number of computational threads varies.

It is well known that for general single and double precision IEEE floating-point numbers, the associative property does not always hold, meaning (a+b)+c may not equal a +(b+c).  Let's consider a specific example. In infinite precision arithmetic 2-63 + 1  + -1 = 2-63. If instead we do this same computation on a computer use double precision floating-point numbers, rounding error is introduced and we clearly see why order of operations becomes important:

(2-63 + 1) + (-1) ≈ 1 + (-1) = 0

versus

2-63 + (1 + (-1)) ≈ 2-63 + 0 = 2-63

This inconsistency in results due to order-of-operations is precisely what the new functions are designed to address.

The application related factors that affect the order of floating-point operations within a single executable program include code-path selection based on run-time processor dispatching, data array alignment, variation in number of threads, threaded algorithms and internal floating-point control settings. Most of these factors can be controlled by the user by properly controlling the number of threads, floating point settings and taking steps to align memory when it is allocated (see this previous article on getting reproducible results). On the other hand run-time dispatching and certain threaded algorithms have not allowed users to make changes that can ensure the same order of operations from run to run.

Intel MKL does run-time processor dispatching in order to identify the appropriate internal code paths to traverse for the Intel MKL functions called by the application. The code paths chosen may differ across a wide range of Intel processors and IA compatible processors and may provide differing levels of performance. For example, an Intel MKL function running on an Intel® Pentium® 4 processor may run an SSE2-based code path, while on a more recent Intel® Xeon® processor supporting Intel® Advanced Vector Extensions (AVX), that same library function may dispatch to a different code-path that uses these AVX instructions. This is because each unique code path has been optimized to match the features available on the underlying processor. The feature-based approach introduces a challenge: if any of the internal floating-point operations are done in a different order, or are re-associated, then the computed results may differ.

Dispatching optimized code-paths based on the capabilities of the processor on which it is running is central to the optimization approach used by Intel MKL so it is natural that there should be some performance trade-offs when requiring consistent results. If limited to a particular code-path, Intel MKL performance can in some circumstances degrade by more than half. This can be easily understood by noting that matrix-multiply performance nearly doubled with the introduction of new processors supporting AVX instructions. In other cases, performance may degrade by 10-20% if algorithms are restricted to maintain the order of operations.

Intel® MKL 11.0 includes new functions and environment variables, shown in figures 1, 2, and 3  designed to help users get bitwise reproducible results  from the Intel MKL functions used (hence conditional bit-wise reproducibility and the use of CBWR).  To better understand how to use these features, some usage examples are provided below. Only the MKL_CBWR_COMPATIBLE option is supported on non-Intel CPUs.

To ensure MKL calls return the same results on all Intel or Intel compatible CPUs supporting SSE2 instructions or later:

Make sure your application uses a fixed number of threads  and call

`mkl_cbwr_set(MKL_CBWR_COMPATIBLE)`

or set the environment variable

`MKL_CBWR=COMPATIBLE`

Note: the special MKL_CBWR_COMPATIBLE option is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses an SSE2 only code-path which does not use these instructions.

To ensure MKL calls return the same results on every Intel CPU that supports SSE2 instructions or later:

Make sure your application uses a fixed number of threads and call

`mkl_cbwr_set(MKL_CBWR_SSE2) `

or set the environment variable

`MKL_CBWR=SSE2`

Note: on non-Intel CPUs results may differ because the MKL_CBWR_COMPATIBLE code path is run instead.

To ensure MKL calls return the same results on every Intel CPU that supports SSE4.1 instructions or later:

Make sure your application uses a fixed number of threads and call

`mkl_cbwr_set(MKL_CBWR_SSE4_1)`

or set the environment variable

`MKL_CBWR=SSE4_1`

Note: on non-Intel CPUs the results may differ because the MKL_CBWR_COMPATIBLE  code path is run instead.

Ensure MKL calls return the same results on every Intel CPU that supports AVX instructions or later:

Make sure your application uses a fixed number of threads and call

`mkl_cbwr_set(MKL_CBWR_AVX)`

or set the environment variable

`MKL_CBWR=AVX`

Note: on non-Intel CPUs the results may differ because the MKL_CBWR_COMPATIBLE code path is run instead. On an  Intel CPU without AVX support, the MKL_CBWR_DEFAULT path is run instead.

Note

In MKL versions before 11.1, there was one more limitation: Input and output arrays in function calls must be aligned on 16, 32, or 64 byte boundaries on systems with SSE / AVX1 / AVX2 instructions support (resp.). MKL 11.1 has dropped this requirement. CNR can be obtained on unaligned input arrays, but aligning data will typically lead to better performance.

Strict CNR Mode

Intel® MKL 2019 Update 3 introduces a new strict CNR mode which provides stronger reproducibility guarantees for a limited set of routines and conditions. When strict CNR is enabled, MKL will produce bitwise identical results, regardless of the number of threads, for the following functions:

• ?gemm, ?trsm, ?symm, and their CBLAS equivalents.

Additionally, strict CNR is only supported under the following conditions:

• the CNR code-path must be set to AVX2 or later, or the AUTO code-path is used and the processor is an Intel processor supporting AVX2 or later instructions;
• the 64-bit Intel MKL libraries are used.

To enable strict CNR, add the new MKL_CBWR_STRICT flag to the CNR code-path you would like to use:

mkl_cbwr_set(MKL_CBWR_AVX512 | MKL_CBWR_STRICT)

or append “,STRICT” to the MKL_CBWR environment variable:

MKL_CBWR = AVX2,STRICT

Besides guaranteeing run-to-run reproducibility for varying numbers of threads, strict CNR also allows the expert user to partition their input matrices, call ?gemm, ?symm, or ?trsm on the resulting sub-problems, then reassemble the results. When strict CNR is enabled, the results will be bitwise identical, no matter how the inputs are partitioned, as long as partitioning is performed only along the dimensions listed in the table below:

Other Reference:

For more complete information about compiler optimizations, see our Optimization Notice.