Getting reproducible results with Intel® MKL

Introduction

The Intel® Math Kernel Library (Intel® MKL) contains highly optimized, extensively threaded math routines for science, engineering, and financial applications that require maximum performance. While performance is the chief reason for the existence of Intel MKL, users count on Intel MKL to employ the best practices available to provide accurate results. In recent years, a growing number of Intel MKL users have noted a number of factors that can affect the numerical results of the library. This article discusses the underlying reason for these variations in output, the mechanisms that cause the variations in Intel MKL, and some ways to improve chances for consistency in certain cases.


The sources of indeterminism

Most users of floating point math libraries expect some rounding error when doing multiple calculations. A simple experiment shows that the rounding error can be different if the order of operations changes. If we assume for a moment double precision floating point numbers, then:

260 + (-260 + 1) = 260 + -260= 0

since -260 + 1 rounds to -260. Yet

(260 + -260) + 1 = 0 + 1= 1

With this simple example in mind we’ll discuss three different cases where methods used in Intel MKL that are crucial to performance cause indeterminate results.

Memory alignment: One of the ways Intel MKL gets good performance is through use of new instructions made available with successive generations of Intel® processors. Some of these instructions make computation more efficient by performing the same floating point operation on multiple floating point numbers at once. The way some of these instructions are loaded however depends on how the data is situated in memory. If in one run of the program, the data happens to be aligned along a 16-byte boundary, then the first 2 double precision numbers in the array would be grouped together, while in the next run, if the array is offset from that memory boundary, then the 2nd and 3rd double precision numbers are grouped together.  This difference in order can cause different results when running the same program two times consecutively with all settings remaining identical. Update: The latest processors supporting Intel Advanced Vector Extensions (AVX) have larger registers and thus change the requirement to alignment along 32-byte boundaries.

Internal Intel MKL code paths: To get the best performance wherever a program is run, Intel MKL will check on the processor type at runtime and can dispatch processor-specific code accordingly. If a particular instruction set or cache of a certain size is available a specialized code path may exploit it. These code paths are different enough that they will again cause a different order of operations and will cause slightly different results on different processors. 

Threading: Finally, Intel MKL has been threaded for over 10 years, but it is only more recently that access to systems with many processors has become common place. Threading is an important way that Intel MKL provides top performance on today’s systems. With different numbers of threads, a problem will be divided in different ways and order of operations change, causing the results to change slightly. If the number of threads is not set, Intel MKL will by default use the number of cores that the OS reports are available on that system. But even in the case when a program is run more than once on the same systems with same number of threads some function in Intel MKL may return differing results. The reason has to do with how these functions use threading to asynchronously accumulate the results from the multiple threads. Since very small changes in the system can cause the threads to return in a different order, the results can be accumulated in a different order, causing slight variations in output.


What can be done?

In the case of memory alignment, the answer is relatively easy. Simply aligning the memory allocated for input data along 16-byte boundaries (or 32-byte boundaries on processors supporting Intel AVX) will ensure that the results will be consistent. To this end we provide the mkl_malloc() functions to allocate memory along these boundaries.

In the case of threading, many functions will return the same results if run with the same number of threads on the same number of cores. For those functions that are exceptions to this, one should set the number of threads to 1, or link the sequential version of the library at significant cost to the performance on systems where Intel MKL would otherwise use a large number of cores.

Finally, because disabling the optimized code paths or threading can have such a large impact on performance, the best option may be to consider making some allowance for small variations in results wherever possible.
For more complete information about compiler optimizations, see our Optimization Notice.