MKL - different result in each processor?

MKL - different result in each processor?

holysword's picture

Hello,

I am having problems with an MPI code using Intel MKL and ifort (Composer version: 13.1.0.146). Each processor has exactly the same matrix, and they should be able to perform some sequential operations. Each processor is expected o obtain exactly the same values, since they are using the same binaries, same libraries and each node is in fact identical (2 Sandy Bridge EP E5-2670 processors in each node). However, routines as CGEMM and CGESVD produce  slightly different values in each processor, a variantion of the order of 1e-6~1e-8. This does not always happen, and it seem to depend on the number of processors being used.

Is this behaviour expected at all? The difference is below the machine precision (considering single precision) but aren't the individual cores suppose to perform the roundoffs in the same manner? If this behaviour is not expected I could provide some example matrices.

Thanks in advance

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

identical results would require assurance of same data alignment mod 32 byte or using the slower consistency option

Zhang Z (Intel)'s picture

Quote:

holysword wrote:

Is this behaviour expected at all? The difference is below the machine precision (considering single precision) but aren't the individual cores suppose to perform the roundoffs in the same manner?

Thank you for asking this question! MKL does have a way to guarantee identical results as long as some preconditions are met. We call this feature "Conditional Numerical Reproducibility". See here for a complete discussion on how to use this feature: http://software.intel.com/en-us/articles/conditional-numerical-reproduci...

>>...However, routines as CGEMM and CGESVD produce slightly different values in each processor, a variantion of the
>>order of 1e-6~1e-8...

Please verify what MKL DLLs are used on both computers. For example, it is possible that on Computer A mkl_def.dll is used and on Computer B mkl_avx.dll is used. You should always verify what set of CPU dependant DLLs ( also known as Waterfall DLLs ) is used on different computers in order to get identical results of calculations.

holysword's picture

Thank you very much TimP, Zhang Z and Sergey Kostrov.

Setting KMP_DETERMINISTIC_REDUCTION=yes and MKL_BWR=SSE4_2 solves the issue with no noticeable slowdown. I still compile with the same optimization flags (including -xAVX). I tried to use MKL_BWR=AVX but that didn't work, I wonder why; all the processors are the same, and they are all EP E5-2670. All the dlls and libraries are the same also.

You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you avoid AVX, but 32 may improve performance, even with SSE).

The variations you quote are consistent with single precision vector sum reduction on arrays of differing alignment.  You could check each address passed to MKL % 16 for consistency.  If you succeed in using the non-deterministic AVX it may not be the identical result as the "deterministic" one.

DETERMINISTIC_REDUCTION may not permit use of AVX-256 as that could require different blocking, incompatible with consistent results.

>>...You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you
>>avoid AVX, but 32 may improve performance, even with SSE)...

I recently did a set of tests with CRT malloc ( default alignment ) and MKL mkl_malloc ( allows to set different allignments ) functions and I didn't see any performance gains when calculating a product of two matricies using MKL sdemm and dgemm functions..

in my tests 32 byte alignment is of more benefit on early core i7 so I agree it may not appear on latest CPU.

holysword's picture

Quote:

TimP (Intel) wrote:
You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you avoid AVX, but 32 may improve performance, even with SSE).

I am sorry, what do you mean with 32-byte boundaries? All variables are defined with the default kind ( that is, just REAL, COMPLEX and INTEGER, no DOUBLE PRECISION, KIND declaration or anything of that sort).

For example, in case of arrays you could try a command line option as follows: ifort /align:array32byte...

Login to leave a comment.