MKL vs Microsoft exp() function

MKL vs Microsoft exp() function

I have a client that is migrating a large C++ software base from 32- to 64-bit code in MS Visual Studio. One of the problems they are having is that the 32- and 64-bit versions of the C library exp() function produce results that differ by 1ulp for some operands, and this is causing regression tests to fail. One potential solution I am considering is to use Intel MKL instead of the Microsoft library. So I have a few questions:

1. Do the 32-bit and 64-bit builds of MKL produce identical results for exp() and other transcendental functions, for all operands, assuming that SSE2 is enabled for our 32-bit code?

2. Although the client has mostly Intel hardware, I believe they have a few AMD Opteron-based server farms. Does MKL work on Opterons? If so, are there any performance penalties if MKL is used in place of the Microsoft library?

3. Is there any way of getting the Microsoft .NET framework to use MKL? I assume it may have the same 32/64-bit differences, although I haven't tested that yet.

4. What other benefits might my client gain by switching to MKL?

Thanks in advance - dc42

58 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You would have to call an MKL exp function explicitly (possibly by macro substitution) to use MKL as a solution, if I understand the question you have posed.  MKL doesn't support scalar math functions, although you could treat a scalar function as a vector of length 1.  You would choose the highest accuracy version; fast vector exponentials normally permit more than 1 ULP variation.

You might consider whether a trial of Intel C++ (which includes MKL) would determine whether there is a better solution.  Intel C++ has its own library of math functions and supports an option /Qimf-arch-consistency:true in which you can give a list of functions for which you want a version which should give the same result on AMD and Intel platforms.  You would also want to set a /arch option such as /arch:SSE3 which works on all your platforms.  You may need also to consider the option /fp:model source which disables vectorization optimizations where data alignment could produce differences of 1 ULP.

MKL libraries (like C++ compiled code) work as unmanaged code under .net.  See this:

http://software.intel.com/sites/default/files/m/4/b/0/6/5/33774-Using_In...

>>...I have a client that is migrating a large C++ software base from 32- to 64-bit code in MS Visual Studio. One of the problems
>>they are having is that the 32- and 64-bit versions of the C library exp() function produce results that differ by 1ulp...

I will do a verification and post results. What Windows platform are you using?

>>1. Do the 32-bit and 64-bit builds of MKL produce identical results for exp() and other transcendental functions, for all
>>operands, assuming that SSE2 is enabled for our 32-bit code?..

Did you try to download an Evaluation version of MKL? I wouldn't expect to see test results from somebody especially for simple, and at the same time critical, tests.

Here are two questions:

- Did you use _set_SSE2_enable CRT function? ( Take a look at MSDN for a description of the function and what it does )

- Could provide more technical details for '...results that differ by 1ulp...'? ( a test-case? outputs? etc )

Tim, thanks for your reply, that was very helpful. Compatibility between results from AMD/Intel processors is not a big issue for us, we've found that differences are rare and we have worked around them. Compatibility between 32- and 64-bit results matters much more. Based on your reply, the Intel compiler might help us, but not MKL.

Sergei, we haven't tried the evaluation version of MKL yet. It would take at least two days to rebuild the code base using MKL and run the tests, and the purpose of my question was to determine whether it would be a worthwhile exercise.I didn't call _set_SSE2_enable, because based on Microsoft's documentation, I don't need to because it is the default on SSE2-enabled processors. I did check with the debugger that it was following the SSE2 code path inside the exp() function. Disassembling the exp() functions revealed that the Microsoft 64-bit and 32-bit SSE2 exp functions use different algorithms.

There are actually two version of exp() functions one of them is exp() and second is _Clexp both of them call into the same SSE2 code implementation.

Hi Sergey, in case you still want to do verification, I am using Visual Studio 2010 under Windows 7 64-bit. Some of the operand values for which the Microsoft implementations of exp() return different values in 32-bit SSE2 and 64-bit builds are:

-0.083296271058701077  0.0066998698389698335  0.0066998698389698344  0.0066998698389698352 

Hi illyapolak, is _Clexp the one that gets called if you have SSE2 code generation and intrinsic functions enabled?

Sergey touched on a good point.  With Microsoft compiler, you need to set /arch:SSE2 in order for the 32-bit compiler to generate SSE2 code as the X64 one does by default.   In VS2012 you have the option to generate AVX code by /arch:AVX.  I don't know the details of how these options impact Microsoft math libraries, but you could expect numerical differences, particularly with float data types, between default 32-bit non-SSE code and 64-bit SSE2, even if the math functions were identical.

The Intel compilers changed the IA32 default to /arch:SSE2 a few years ago, so that IA32 and Intel64 defaults match.  Only the 32-bit compilers present the choice of x87 code, which is default for Microsoft CL and invoked by /arch:IA32 for Intel ICL.   The latter option may give you the same math functions which CL uses by default.

Hi David,

I do not know when compiler chooses to call CI version of exp function.By looking at msvcrt dll I can see that both versions have different entry points and different ordinals.Both of implementation have different prolog,but main calculation block is the same.By looking at code I can see that some kind of Horner scheme is used.

>>>0.0066998698389698335  0.0066998698389698344  0.0066998698389698352 >>>

It differs each other by 14 decimal orsers of magnitude.

Hi Tim, we're using VS2010 and we're already setting /arch:sse2 in an attempt to get the 32- and 64-bit results to match as closely as possible. I have verified with the debugger that the SSE2 code for 32-bit exp() is executed.

Hi illyapolak, yes those 3 operands differ from the next in the series by only 1ulp. All of them produce the same value for exp(), with the 32-bit SSE2 and 64-bit results differing by 1ulp. The problem we have is not that at some of the results are more than 0.5ulp out, it's lack of reproducibility between 32- and 64-bit builds of the applications, which means that test results have to be inspected manually

>>>I have verified with the debugger that the SSE2 code for 32-bit exp() is executed>>>

Can you post mentioned above code?

>>>I assume it may have the same 32/64-bit differences, although I haven't tested that yet.>>>

I think that you can expect the same differences because NET runtime will probably call C runtime functions.

>>>The problem we have is not that at some of the results are more than 0.5ulp out, it's lack of reproducibility between 32- and 64-bit builds of the applications, which means that test results have to be inspected manually>>>

I think that this is silly question, but can you indentify those input arguments when your regression test fail and provide precalculated results to ported application?

>>...It would take at least two days to rebuild the code base using MKL and run the tests, and the purpose of my
>>question was to determine whether it would be a worthwhile exercise...

I don't think this could be allowed on some projects to proceed with changing main codes, that is integration of a very big library, like MKL, without having a clear picture ( a complete test case! ) on how MKL exp() function could change matters. You could simply spend even more days on integration of MKL and could get a negative result. Even if a negative result is also result a couple of days will be lost.

>>Some of the operand values for which the Microsoft implementations of exp() return different values in 32-bit
>>SSE2 and 64-bit builds are:
>>
>>-0.083296271058701077 0.0066998698389698335 0.0066998698389698344 0.0066998698389698352

Here are binary representations of these numbers:

0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101

Any questions? Please provide a simple test case and I'll take a look at it on a couple of platforms. Thanks in advance.

Hi illapolak,

>>Can you post mentioned above code?<<

You mean the disassembly? That will have to wait until I am back in the office.

>>I think that this is silly question, but can you indentify those input arguments when your regression test fail and provide precalculated results to ported application?<<

That would mean using different regression test data for the 32- and 64-bit builds. We want to use the same data for both builds, partly because it saves time, partly as a check that we haven't done anything wrong in porting the app to 64 bits.

Hi Sergey,

>>Here are binary representations of these numbers:

0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101<<

That may be true in single-precision (float) maths, but we're using double precision. Expressed as doubles, those 3 numbers are different. More precisely, they are decimal approximations of 3 adjacent binary double-precision (64-bit) IEEE floating-point numbers in a series generated using the _nextafter function (see http://msdn.microsoft.com/en-us/library/h0dff77w(v=vs.100).aspx). I'll post the test code when I am back in the office.

>>Can you post mentioned above code?<<

You mean the disassembly? That will have to wait until I am back in the office.>>>

Yes. Thank you.

>>>0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101>>>

decimal value has precision of  17 decimal digits so it should be represented by double precision floating point value.

>>...I'll post the test code when I am back in the office...

I will need as more as possible details about your software development environment, that is, OS version, CPUs of computers, VS versions and Editions, etc.

>>...That may be true in single-precision (float) maths, but we're using double precision. Expressed
>>as doubles, those 3 numbers are different...

David, That really makes a difference ( please don't skip such details next time... )!

[ Float ]

0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101

[ Double ]

0.0066998698389698335 -> Binary = 0x3F7B71529D88875E = Binary A
0.0066998698389698344 -> Binary = 0x3F7B71529D88875F = Binary B
0.0066998698389698352 -> Binary = 0x3F7B71529D888760 = Binary C

Binary A = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011110
Binary B = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011111
Binary C = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01100000

David,

Which binary, that is Binary A, or Binary B, or Binary C, gives a right result(s) in your regression tests?

Note: Data for Long Double will be posted later.

[ Long Double ]

0.0066998698389698335 -> Binary = 0x... = not ready yet
0.0066998698389698344 -> Binary = 0x... = not ready yet
0.0066998698389698352 -> Binary = 0x... = not ready yet

>>...That would mean using different regression test data for the 32- and 64-bit builds.

This is a wrong direction.

>>We want to use the same data for both builds, partly because it saves time, partly as a check that we haven't done anything
>>wrong in porting the app to 64 bits...

This is a right direction and input data sets and source codes ( 32-bit and 64-bit ) should be identical. Only in that case consistency of all calculations could be achieved.

By the way, Do you use a CRT function _control87 that controls precision ( of calculations ) and another parameters of FP units?

Hi Sergey,

Long double is the same as double on the Microsoft platform (i.e. 64 bits), so is not of interest to us. Regarding binary A/B/C, they are not results, they are operand values for which 32-bit SSE2 and 64-bir exp() return different results. We don't use _control87, we accept the Microsoft default which is that calculations are always rounded to 64 bits - that should give us better agreement between x87 calculations and SSE2 calculations.

>>>Hi illyapolak, is _Clexp the one that gets called if you have SSE2 code generation and intrinsic functions enabled?>>>

If you could post disassembly of called exp() implementation  then I could  compare it with msvcrt exp() exports.

>>Long double is the same as double on the Microsoft platform (i.e. 64 bits), so is not of interest to us.

I see and I won't stress on it any longer

>>Regarding binary A/B/C, they are not results, they are operand values for which 32-bit SSE2 and
>>64-bir exp() return different results...

I understand that and at the same time I did some analysis of these numbers. Please take a look at my notes because you're dealing with differences in these values and they are very close to Epsilon for double floating-point data type.

[ Note 1 ]

...
#define IPP_EPS52 ( 2.2204460492503131e-016 )
...

Epsilon52 = 2.2204460492503131e-016

A = 0.0066998698389698335 -> BinaryA = 0x3F7B71529D88875E
B = 0.0066998698389698344 -> BinaryB = 0x3F7B71529D88875F
C = 0.0066998698389698352 -> BinaryC = 0x3F7B71529D888760
D = 0.0066998698389698300 -> BinaryD = 0x3F7B71529D88875A - Note: Last 2-digits of (D) are set to 0

BinaryA = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011110
BinaryB = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011111
BinaryC = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01100000
BinaryD = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011010

Numbers (A-D), (B-D), and (C-D) are less than Epsilon (!) for a Double data type:

(A-D) -> ( 0.0000000000000000035 - EPS52 ) = -0.00000000000000021854460492503131
(B-D) -> ( 0.0000000000000000044 - EPS52 ) = -0.00000000000000021764460492503131
(C-D) -> ( 0.0000000000000000052 - EPS52 ) = -0.00000000000000021684460492503131

0.0000000000000000035 -> BinaryA2 = 0x3C5024121797FB36
0.0000000000000000044 -> BinaryB2 = 0x3C544A9A66CDB0D6
0.0000000000000000052 -> BinaryC2 = 0x3C57FB1390C48B2C

[ Note 2 ]

(A-D) = -0.00000000000000021854460492503131 -> BinaryA3
(B-D) = -0.00000000000000021764460492503131 -> BinaryB3
(C-D) = -0.00000000000000021684460492503131 -> BinaryC3

BinaryA3 = 0xBCAF7EDF6F434026
BinaryB3 = 0xBCAF5DAB2CC99279
BinaryC3 = 0xBCAF40276379DBA7

[ Note 3 ]

abs(A-D) = 0.00000000000000021854460492503131 -> BinaryA4
abs(B-D) = 0.00000000000000021764460492503131 -> BinaryB4
abs(C-D) = 0.00000000000000021684460492503131 -> BinaryC4
Epsilon52 = 2.2204460492503131e-016 -> BinaryE4

BinaryA4 = 0x3CAF7EDF6F434026
BinaryB4 = 0x3CAF5DAB2CC99279
BinaryC4 = 0x3CAF40276379DBA7

Epsilon52= 0x3CB0000000000000

abs(A-D) = 0.00000000000000021854460492503131 -> BinaryA4
abs(B-D) = 0.00000000000000021764460492503131 -> BinaryB4
abs(C-D) = 0.00000000000000021684460492503131 -> BinaryC4

[ Note 4 ]

0.00000000000000021854460492503131 -> BinaryA4
0.00000000000000021764460492503131 -> BinaryB4
0.00000000000000021684460492503131 -> BinaryC4
2.2204460492503131e-016 -> BinaryE4

BinaryA4 = 00111100 10101111 01111110 11011111 01101111 01000011 01000000 00100110
BinaryB4 = 00111100 10101111 01011101 10101011 00101100 11001001 10010010 01111001
BinaryC4 = 00111100 10101111 01000000 00100111 01100011 01111001 11011011 10100111
BinaryE4 = 00111100 10110000 00000000 00000000 00000000 00000000 00000000 00000000

[ Note 5 ]

0.0000000000000001 = More accurate representation = 9.99999999999999979097786724035E-17
Epsilon52......... = More accurate representation = 2.22044604925031308084726333618E-16 ( Epsilon52 )
0.0000000000000002 = More accurate representation = 1.99999999999999995819557344807E-16 ( Epsilon52 Truncated )
0.0000000000000010 = More accurate representation = 1.00000000000000007770539987666E-15

Please take a look at these threads:

Forum topic: Mixing of Floating-Point Types ( MFPT ) when performing calculations. Does it improve accuracy?
Web-link: software.intel.com/en-us/forums/topic/361134

Forum topic: Tips for Porting software to a 64-bit platform
Web-link: software.intel.com/en-us/forums/topic/277738

@Sergey

I am just curious do you use Win7 calculator for all those decimal->hex->binary conversions?

>>...I am just curious do you use Win7 calculator for all those decimal->hex->binary conversions?

No.

David, Please take a look at float.h header file:

...
#define DBL_DIG 15 /* # of decimal digits of precision */
...

and it means that only 15 digits in a variable of a long double type need to be used ( I know that it is really hard to follow that rule in practical applications... ). Next, I proved that rounding to 17 digits ( instead of using all 19 digits of input test values ) provides consistent results for tests with 32-bit applications compiled with different C++ compilers and MKL / IPP libraries.

Here are results of my tests on a 32-bit platform and you need to complete similar tests on a 64-bit platform before making a final decision on how to solve the problem.

32-bit test application ( No Rounding for Test Values )

...
RTdouble dV11 = CrtExp( ( long double )0.0066998698389698335L );
RTdouble dV21 = CrtExp( ( long double )0.0066998698389698344L );
RTdouble dV31 = CrtExp( ( long double )0.0066998698389698352L );
CrtPrintf( RTU("V11 = %.18f\nV21 = %.18f\nV31 = %.18f\n"), dV11, dV21, dV31 );
CrtPrintf( RTU("\n") );
...

[ Intel C++ compiler ]

V11 = 1.006722364175213900
V21 = 1.006722364175213900
V31 = 1.006722364175213900

[ Microsoft C++ compiler ]

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

[ Borland C++ compiler ]

V11 = 1.006722364175213658 Note: Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )
V21 = 1.006722364175213658
V31 = 1.006722364175213658

[ MinGW C++ compiler ]

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

32-bit test application ( Rounding for Test Values / Last two digits are set to 0 )

...
RTdouble dV11 = CrtExp( ( long double )0.0066998698389698300L );
RTdouble dV21 = CrtExp( ( long double )0.0066998698389698300L );
RTdouble dV31 = CrtExp( ( long double )0.0066998698389698300L );
CrtPrintf( RTU("V11 = %.18f\nV21 = %.18f\nV31 = %.18f\n"), dV11, dV21, dV31 );
...

[ Intel C++ compiler ]

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

[ Microsoft C++ compiler ]

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

[ Borland C++ compiler ]

V11 = 1.006722364175213658 Note: Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )
V21 = 1.006722364175213658
V31 = 1.006722364175213658

[ MinGW C++ compiler ]

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

32-bit test application with MKL library ( No Rounding for Test Values )

[ Intel C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

[ Microsoft C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

[ Borland C++ compiler ]

vdExp:

dR[0] = 1.006722364175213658 Note: Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )
dR[1] = 1.006722364175213658
dR[2] = 1.006722364175213880 Note: Provides better accuracy and doesn't round last three digits ( 880 ~= 900 )

vmdExp:

dR[0] = 1.006722364175213658
dR[1] = 1.006722364175213658
dR[2] = 1.006722364175213880

[ MinGW C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

32-bit test application with MKL library ( Rounding for Test Values )

[ Intel C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

[ Microsoft C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

[ Borland C++ compiler ]

vdExp:

dR[0] = 1.006722364175213658 Note: Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )
dR[1] = 1.006722364175213658
dR[2] = 1.006722364175213658 Note: Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )

vmdExp:

dR[0] = 1.006722364175213658
dR[1] = 1.006722364175213658
dR[2] = 1.006722364175213658

[ MinGW C++ compiler ]

vdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

vmdExp:

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

Epxression '...Provides better accuracy and doesn't round last three digits ( 658 ~= 700 )...' means that CRT-function printf of Borland C++ compiler provides better accuracy because it doesn't use any Microsoft's CRT libraries and functions.

32-bit test application with IPP library ( No Rounding for Test Values )

...
// IPPAPI( IppStatus, ippsExp_64f_A50, ( const Ipp64f a[], Ipp64f r[], Ipp32s n ) )
// IPPAPI( IppStatus, ippsExp_64f_A53, ( const Ipp64f a[], Ipp64f r[], Ipp32s n ) )

CrtPrintf( RTU("\n") );

Ipp64f a50[4] = { 0.0066998698389698335L, 0.0066998698389698344L, 0.0066998698389698352L };
Ipp64f r50[4] = { 0.0L, 0.0L, 0.0L, 0.0L };

Ipp64f a53[4] = { 0.0066998698389698335L, 0.0066998698389698344L, 0.0066998698389698352L };
Ipp64f r53[4] = { 0.0L, 0.0L, 0.0L, 0.0L };

IppStatus st = ippStsNoErr;

st = ippsExp_64f_A50( &a50[0], &r50[0], 4 );
CrtPrintf( RTU("r50[0] = %.22f\nr50[1] = %.22f\nr50[2] = %.22f\n"), a50[0], a50[1], a50[2] );

st = ippsExp_64f_A53( &a53[0], &r53[0], 4 );
CrtPrintf( RTU("r53[0] = %.22f\nr53[1] = %.22f\nr53[2] = %.22f\n"), a53[0], a53[1], a53[2] );
...

[ Microsoft C++ compiler with IPP library ]

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213900

[ Intel C++ compiler with IPP library ]

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213700

32-bit test application with IPP library ( Rounding for Test Values )

...
// IPPAPI( IppStatus, ippsExp_64f_A50, ( const Ipp64f a[], Ipp64f r[], Ipp32s n ) )
// IPPAPI( IppStatus, ippsExp_64f_A53, ( const Ipp64f a[], Ipp64f r[], Ipp32s n ) )

CrtPrintf( RTU("\n") );

Ipp64f a50[4] = { 0.0066998698389698300L, 0.0066998698389698300L, 0.0066998698389698300L };
Ipp64f r50[4] = { 0.0L, 0.0L, 0.0L, 0.0L };

Ipp64f a53[4] = { 0.0066998698389698300L, 0.0066998698389698300L, 0.0066998698389698300L };
Ipp64f r53[4] = { 0.0L, 0.0L, 0.0L, 0.0L };

IppStatus st = ippStsNoErr;

st = ippsExp_64f_A50( &a50[0], &r50[0], 4 );
CrtPrintf( RTU("r50[0] = %.22f\nr50[1] = %.22f\nr50[2] = %.22f\n"), a50[0], a50[1], a50[2] );

st = ippsExp_64f_A53( &a53[0], &r53[0], 4 );
CrtPrintf( RTU("r53[0] = %.22f\nr53[1] = %.22f\nr53[2] = %.22f\n"), a53[0], a53[1], a53[2] );
...

[ Microsoft C++ compiler with IPP library ]

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213700

[ Intel C++ compiler with IPP library ]

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213700

David,

Here are results on a WIndows 7 Professional 64-bit with a 64-bit test application:

[ Microsoft C++ compiler - 64-bit / VS 2008 Professional Edition / No Rounding of Test Values ]

exp

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

IPP

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213900

MKL - vdExp

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

MKL - vmdExp

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213900

Here are results on a Windows 7 Professional 64-bit with a 64-bit test application:

[ Microsoft C++ compiler - 64-bit / VS 2008 Professional Edition / Rounding of Test Values ]

exp

V11 = 1.006722364175213700
V21 = 1.006722364175213700
V31 = 1.006722364175213700

IPP

r50[0] = 1.006722364175213700
r50[1] = 1.006722364175213700
r50[2] = 1.006722364175213700

r53[0] = 1.006722364175213700
r53[1] = 1.006722364175213700
r53[2] = 1.006722364175213700

MKL - vdExp

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

MKL - vmdExp

dR[0] = 1.006722364175213700
dR[1] = 1.006722364175213700
dR[2] = 1.006722364175213700

>>>Here are results on a WIndows 7 Professional 64-bit with a 64-bit test application:>>>

It seems that MS and Intel implementations use the same algorithm.Probably Horner scheme with precalculated coefficinets.

David,

Here is a test case with MKL functions vdExp and vmdExp:

...
// _Mkl_Api( void, vdExp, ( const MKL_INT n, const double a[], double r[] ) )
// _Mkl_Api( void, vmdExp, ( const MKL_INT n, const double a[], double r[], MKL_INT64 mode ) )

double dA[4] = { 0.0066998698389698335L, 0.0066998698389698344L, 0.0066998698389698352L };
// double dA[4] = { 0.0066998698389698300L, 0.0066998698389698300L, 0.0066998698389698300L };
double dR[4] = { 0.0L, 0.0L, 0.0L, 0.0L };

::vdExp( 4, &dA[0], &dR[0] );
printf( "dR[0] = %.18f\ndR[1] = %.18f\ndR[2] = %.18f\n", dR[0], dR[1], dR[2] );

printf( "\n" );

::vmdExp( 4, &dA[0], &dR[0], 0 );
printf( "dR[0] = %.18f\ndR[1] = %.18f\ndR[2] = %.18f\n", dR[0], dR[1], dR[2] );
...

Note: A commented line has Rounded values.

Here's the code I used to print the values returned by exp(x) for some ranges of interest:

#include <SDKDDKVer.h>
#include <stdio.h>
#include <tchar.h>
#include "float.h"
#include "math.h"

int _tmain(int argc, _TCHAR* argv[])
{
      // Print out a sequence of adjacent double values and the corresponding values returned by exp()
      //double d = -0.083296271058700000;
      //const double dir = -1.0;
      double d = 6.6998698389698000e-3;
      const double dir = 1.0;

      for (int i = 0; i < 100; ++i)
      {
          double e = exp(d);
          printf("%20.19f   %20.17f\n", d, e);
          d = _nextafter(d, dir);
      }
      return 0;
}

When this program is run, the results for 32- and 64-bit runs differ for a few values in the range.

Regards - David

Can you post disassembly of exp()?

Quote:

iliyapolak wrote:

Can you post disassembly of exp()?

I'm not sure I should do that, it might incur the wrath of Microsoft as it's their copyright. You can disassemble it yourself quite easily if you run the test program I gave under the debugger and step into exp().

>>...When this program is run, the results for 32- and 64-bit runs differ for a few values in the range...

As I've already mentioned if these differences are less than Epsilon for a Double-Precision floating point ( DP FP ) type then all these values "could be considered" as the same. If you need consistent results across all platforms try to mask 6 ( or better 7 ) last bits of a binary value which represents a value of DP FP and you will get the same results.

Of course, you could proceed with your own way but I see that a change from CRT-function exp() to MKL vector exp-functions doesn't make sense.

>>...
>>d = _nextafter(d, dir);
>>...

Why do you need _nextafter?

Hi David, this is definitely becoming a lengthy thread …

You are correct that we do NOT want disassembled sources from other libraries posted here. Doing so may violate a licensing agreement.

Now onto your original questions:

  • Do the 32-bit and 64-bit builds of MKL produce identical results for exp() and other transcendental functions, for all operands, assuming that SSE2 is enabled for our 32-bit code?

Intel MKL does  not currently ensure  that the results between our 32-bit and 64-bit libraries match across all MKL domains. I’ll check with the team to confirm this statement for the vector math domain, specifically. We do have a new conditional numerical reproducibility feature, but it does not apply across 32-bit and 64-bit OSs, see http://software.intel.com/en-us/articles/conditional-numerical-reproducibility-cnr-in-intel-mkl-110

  • Although the client has mostly Intel hardware, I believe they have a few AMD Opteron-based server farms. Does MKL work on Opterons?

Yes.

  •  If so, are there any performance penalties if MKL is used in place of the Microsoft library?

It is difficult to say based on the application and functions called. Intel MKL contains optimized vector implementations while you appear to be interested in scalar or simd-style math functions, so there may be additional call overheads associated with the switch. You may want to create a timing program to compare the differences between the functions of interest and the latest version of Intel MKL. Either that or if possible try your application with the Intel compiler – its scalar math library and simd-style (short vector math library) are also highly optimized. This information may also be of use: http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm

  • Is there any way of getting the Microsoft .NET framework to use MKL?

Here is a write-up on a related subject – maybe it will help: http://software.intel.com/en-us/articles/some-more-additional-tips-how-to-call-mkl-from-your-c-code

  • I assume it may have the same 32/64-bit differences, although I haven't tested that yet.

The .NET aspect should not change how Intel MKL behaves internally. So yes, expect differences.

  • What other benefits might my client gain by switching to MKL?

The key benefit of Intel MKL is performance, while  overall application performance is very much  application specific. I recommend that you try the Intel compiler first, and see how that works for you. If you are able to identify where the code calls elementary functions on lengthy double/single precision vectors, Intel MKL’s vector math functions may help boost performance. The conditional numerical reproducibility feature in Intel MKL (and Intel compiler) should  also allow you to get consistent results across a variety of hardware platforms assuming the application executes on the same OS.

Thanks, Shane

This is a short follow up:

[ David wote ]

>>...this is causing regression tests to fail...

David,

Could you explain how these differences ( once again, less then Epsilon for DP FP type ) affect some real processing, please?

Note: Sorry, just generic and Not related to MKL...

There is a possibility that your client will always have some differences in numbers calculated on different platforms, that is 32-bit and 64-bit. In a mission critical software systems, for example, in healthcare, finance, aerospace, geomatics or defense, a concept of tolerences is used and there are many simple ways of how it could be implemented. On a very complex X-Ray image processing system our mathematician with MS degree in Mathematics simply informed us that 6 digits precision will satisfy all requirements a company is forced to follow, related to ISO 8001 standard, and everything is good if a relative error is Not greater that 2% from some reference data set.

>>...Intel MKL does not currently ensure that the results between our 32-bit and 64-bit libraries match across all MKL domains...

I provided lots of test cases in the thread and results, like:

>>Here are results on a Windows 7 Professional 64-bit with a 64-bit test application:
>>
>>[ Microsoft C++ compiler - 64-bit / VS 2008 Professional Edition / Rounding of Test Values ]
>>
>>exp
>>
>>V11 = 1.006722364175213700
>>V21 = 1.006722364175213700
>>V31 = 1.006722364175213700
>>
>>IPP
>>
>>r50[0] = 1.006722364175213700
>>r50[1] = 1.006722364175213700
>>r50[2] = 1.006722364175213700
>>
>>r53[0] = 1.006722364175213700
>>r53[1] = 1.006722364175213700
>>r53[2] = 1.006722364175213700
>>
>>MKL - vdExp
>>
>>dR[0] = 1.006722364175213700
>>dR[1] = 1.006722364175213700
>>dR[2] = 1.006722364175213700
>>
>>MKL - vmdExp
>>
>>dR[0] = 1.006722364175213700
>>dR[1] = 1.006722364175213700
>>dR[2] = 1.006722364175213700

Please let me know if you have different numbers ( with all these 5 exp-functions ) and I'll be glad to look at it again on what is wrong.

Hi David,

I represent Intel team doing math libraries for Intel software products such as compilers, MKL and IPP. A couple of notes on this email thread.

There is no guarantee that any of math libraries considered in this thread (MSFT or Intel C++ compiler math library, MKL or IPP) will pass client's regression test after the migration. Formally, 32 and 64 bit compilers are different and they may generate slightly different code sequences which may affect the numeric reproducibility of results and these may cause the regression test to fail. It may or may not be the case in your situation - it's just a general note.

As indicated above typically there must be some tolerance threshold set which is in right balance between particular application requirements and the reality that numerical results may slightly diverge on slightly different binaries. As I mentioned before slightly different binary may be due to the use of different compiler versions (e.g. newer compiler adds more optimizations and as a result it changes the generated instruction sequences even on the same OS and in the same environment), or different compilers (32 vs. 64 bit or MSFT vs. Intel) or different math libraries (assuming you transition from standard libm shipped with the compiler to other math library such as MKL or IPP).

So my first recommendation is to reconsider whether the existing regression test matches your needs, e.g. is this really a hard requirement to work identically in different environments such as 32 vs. 64 bit OS.

Please remember that math library (exp() function in your case) is potentially only one of sources of numeric deviations during migration.

Second, as you indicated, 1 ulp differences in exp results are rare and these rare events result in test results divergence. This raises another question about numerical stability of the algorithm used in the test. In other words is the tolerance threshold is too small or is the numerical algorithm too sensitive to such small perturbations? Maybe the right path is to reconsider the numerical algorithm to make it more robust?

Assuming that you've done such evaluation and came to the conclusion that the thresholds are right sized, numerical algorithm is robust and the only source of your pain is exp(). Well, the only guaranteed recipe in this case will be to use the correctly rounded math library such as open source CRLIBM. In this case you'll have guaranteed identical results from this math library in all environments. All other "quick-and-dirty" tests provide no guarantee that in some rare case you don't get slightly different result.

Hope it was useful,

Sergey

Quite a lot of things to reply to since my last post.

Sergey K, yes the differences in 32- and 64-bit results are very small, only 1ulp. However, under some conditions they get multiplied, and in one case this actually caused the result of a computation to change sign. I have established that in the cases we looked at, the differences in results are not significant in the context of the application, so either value returned by exp() is acceptable. The problem is lack of reproducibility between 32- and 64-bit builds. Every time we get a difference, someone is going to have to take a look at the differences and decided whether they are insignificant differences attributable to 32/64-bit maths differences, or might be significant differences caused by bugs introduced in migrating the code to 64-bits.

We already have a tolerance factor in our regression test results to allow for minor differences. This has been adequate in the past. Unfortunately, a single tolerance factor is not suitable for all the results we compute. We're looking at allowing different tolerances for different types of result, but I suspect that this won't prove sufficient in all cases, because the allowable tolerance may depend on the input data in some cases.

Sergey M, it's not a hard requirement that 32- and 64-bit results match to within the tolerance factor we already allow, however it's going to make releasing software more expensive if they don't, due to the need to take a close look at test results that differ and sign them off. We can accept a one-off change in test results if we migrate to a different implementation of exp(), because we would know that our code hasn't changed and any differences in results are caused by the change in library; then we can save down new regression data for use in future tests. I appreciate that there may be other reasons why 32- and 64-bit results might be different, however we have established that in the sample we looked at, exp() was responsible for at least 90% of the test failures. So changing to a library that uses the same algorithm for computing exp() in both builds will greatly improve the situation even if it doesn't totally fix it.

>>...Sergey K, yes the differences in 32- and 64-bit results are very small, only 1ulp. However, under some conditions they get
>>multiplied, and in one case this actually caused the result of a computation to change sign...

1. We're discussing that matter for more than 6 days and a problem with sign is a really new development and please try to explain what these conditions are.

2. In 2012 year two problems with Intel C++ compiler were detected related to an incorrect sing for values returned by 'sin', 'cos' and 'mod' functions. Even if you're using Microsoft C++ compiler it makes sence to very all math CRT-functions used on the project.

3. Based on my real experience with getting very consistent results on many platforms and many C++ compilers it is possible that something else is wrong with regression tests, or with a processing in a real application.

4. I could do a verification for Windows XP and Windows 7 with the following set of C++ compilers:

Visual C++ 4.2, Visual C++ 6.x, VS 2005, VS 2008, VS 2010, VS2012, Intel versions 8, 11 and 13, MinGW, Borland C++ and Turbo C++

if you provide a test case that simulates that change of sign.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today