My floating point rounding program behaves differently on X86 and X64 targets

My floating point rounding program behaves differently on X86 and X64 targets



I don’t know where to ask this question so I post it here:


I have got a very simple windows C program, compiled with VS studio 2015, that does not behave the same when compiled for X86 or X64 targets.

The program converts the unsigned __int64 number “0x8000000000000410” to a long double and then converts it back to unsigned __int64, with rounding involved because all the bits cannot fit on the 52 bits of the ieee754 64 bit floating point number significand; I tested it on two different machines. On the X86 target, it returns 0x8000000000000800, whereas it returns 0x8000000000000000 on the X64 target (last bit differs after rounding).

The code is the following:


#include <conio.h>

#include <stdio.h>

#include <float.h>

#include <fenv.h>


int main()


                long double d;

                unsigned __int64 w1, w2;


                _controlfp_s(NULL, _RC_NEAR, _MCW_RC);


                w1 = 0x8000000000000410;

                d = (long double)w1;

                w2 = (unsigned __int64)d;


                printf("Result: 0x%016I64x\n", w2);

                printf("\n<press a key to exit...>");



                return 0;




Other precisions:

The testing is done with the “round to nearest” mode (value of MXCSR set to 0x1F80 thanks to the call to _controlfp_s(NULL, _RC_NEAR, _MCW_RC) on the X64 target)

For the “round down” and “round up” modes, the values are the same on both targets (result for “round down” is 0x8000000000000000) whereas result for “round up” is 0x8000000000000800)

However, for the “round toward zero” (chop) mode, the result differs once again, and the result is 0x8000000000000000 on the X86 target, and 0x8000000000000800 on the X64 target).

It seems to me that the correct results are those given on the X86 target, because they are closer to the original number (in the “nearest” case) and closer to zero (in the “toward zero” case).

I would like to know if this result can be reproduced, and if so, if there are settings somewhere that I missed and that could be changed so that the two targets behave the same.

(I join the .cpp file and a compile.bat file that compiles a X86 version and a X64 version of the exe).


Thanks in advance for your kind answer.


Downloadapplication/zip TestProg.zip965 bytes
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I think you're saying that your comparison is between a 32-bit Windows OS and X64 Windows.  If so, the OS X87 mode settings for launching a .exe are different; 64-bit mode for x86 and 53-bit mode for x64 (although there have been occasional lapses in Windows releases).  If you build an application with ICL /QxIA32, the compiler should insert code to initialize to 53-bit mode, unless you have set -Qpc80, which should set 64-bit mode.  I'm not sure about CL, except that 32-bit CL used to default to X87 code if you didn't set /arch, while 64-bit CL and ICL have no such setting. I haven't had a 32-bit Windows since X64 became available, but CL doesn't pretend to support long double except as a synonym for double. Evidently, you can over-ride the initial settings of X87 mode, as you have done for simd modes.

I don't think there is anything approaching satisfactory support for 64-bit mode long double in X64.  Even in 32-bit OS with ICL, library function calls don't handle 64-bit mode correctly.  There may be such differences between a 32-bit program running on 32- and 64-bit Windows, if you don't take care of the mode settings.

Round toward 0 in 32-bit x87 code when casting float to integer is done by inserting instructions to change the rounding mode and then change it back.  ICL used to have a setting to compile everything with round to 0, so as to speed up (int) casts; I don't know if it is still supported. 64-bit compilers will be using simd scalar or parallel instructions, again temporarily setting round to 0.

32-bit compilation treats __int64 as a struct of presumably signed and unsigned ints.  Your cast may involve a library call, possibly the same one for CL and ICL.  You could examine the compiler generated asm code and trace with debugger (if the library doesn't have guards against debug tracing).

The original versions of Windows X64 didn't permit X87 code at all.  X87 was enabled primarily to permit legacy 32-bit programs to run, without taking care of bit-for-bit consistency.

Hello Tim P.,


Thank’s a lot for your answer.


Maybe I was not clear in describing what I call a “target” here, but If you take the time to read the compile.bat file that I join with this message, you may see that the compilation I do to get my EXEs is quite simple:

I compile the .cpp that is given in my original message (and that I join also in the zip file) with Visual studio, with “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe” to get a 32 bits code console exe named test_X86.exe and also with C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\amd64\cl.exe to generate a 64 bits version of the exe named test_X64.exe of the exe. Then I launch the two programs. In each of them the rounding mode is set to “nearest” thanks to the call to _controlfp_s(NULL, _RC_NEAR, _MCW_RC); at the start of the program. All of this is done on two different machines, each of them running 64 bits windows 10 pro.

test_X86.exe returns 0x8000000000000800, which is the correct answer for the rounding in the nearest mode, whereas test_X64.exe returns 0x8000000000000000, which is a wrong answer as if it was rounded toward zero instead of nearest.


The 64 bits assembly code is the following on the X64 target:

       w1 = 0x8000000000000410;

00007FF60891180E  mov         rax,8000000000000410h 

00007FF608911818  mov         qword ptr [w1],rax 

       d = (long double)w1;

00007FF7A40B181C  mov         rax,qword ptr [w1] 

00007FF7A40B1820  cvtsi2sd    xmm0,rax 

00007FF7A40B1825  test        rax,rax 

00007FF7A40B1828  jge         main+62h (07FF7A40B1832h) 

00007FF7A40B182A  addsd       xmm0,mmword ptr [__real@43f0000000000000 (07FF7A40B9C80h)] 


And I get {0x43e0000000000000, 0x0000000000000000} as a result in XMM0 (rounded toward zero)


The X86 32 bits code is the following on the 32 bits target:

       w1 = 0x8000000000000410;

00C71889  mov         dword ptr [w1],410h 

00C71890  mov         dword ptr [ebp-18h],80000000h 

       d = (long double)w1;

00C71897  mov         edx,dword ptr [ebp-18h] 

00C7189A  mov         ecx,dword ptr [w1] 

00C7189D  call        __ultod3 (0C71005h) 

00C718A2  movsd       mmword ptr [d],xmm0 


And I get {0x43e0000000000001, 0x0000000000000000} as a result in XMM0 (rounded nearest, as expected)


I also added a call to _mm_getcsr() intrinsic right before that (not in the program I posted) in order to check that the value of mxcsr was positioned correctly by calling crt’s function _controlfp_s  (MXCSR is set to 0x00001F80 so the rounding bits 14 and 13 are set to their correct “nearest” value of 00).


I followed your advice and debugged, and I think I found out why the problem is happening. This is probably due to the code generated by MS Visual studio 2015 trying to work with negative values before adding 2^64 to make the number positive, which looses precision on the last bit. In 32 bits mode, there are two executions of cvtsi2sd, one for the lower 32 bits dword, one for the higher 32 bits dword. This all fits on the significands and then gets added correctly. On the X64 target, the last bit is lost during the conversion from negative to positive. I don’t know if efficient X64 code as short as the one compiled here could be written that does not require more precision by using cvtsi2sd twice instead of once like this is done on the X32 code.



On Intel X86, there are two possible ways to do floating-point calculations:

The old way, using the old floating-point co-processor engine that uses extended 80-bit internal representation ,

Or using the new SSE floating-point hardware, using either 64-bit representations or 32-bit representations. ' double' enforces 64-bit representation.

The old, extended-precision engine is called using mnemonics like FMUL, FADD, FSUB, FDIV and work using a special floating-point stack. Everything - short integers, integers, long, _int64, float, doubles, long doubles - everything is converted to the extended precision format. The calculations are done using this extra accuracy. Results are converted back (and truncated ) to the storage class of the result variable.

The new engine uses registers Xmm and SSE mnemonics like 'movSD, mulSD, addSD ' for doubles. There are different mnemonics for every  storage class.

Conversion from the doubles to integers could be done using both engines, using different mnemonics, with different results ...

The name 'long double' , in many C compilers, refers to the 80-bit representation.  BUT NOT ALL OF THEM. Citing Wikipedia,

On the x86 architecture, most C compilers implement long double as the 80-bit extended precision type supported by x86 hardware (sometimes stored as 12 or 16 bytes to maintain data structure alignment), as specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). An exception is Microsoft Visual C++ for x86, which makes long double a synonym for double.[2] The Intel C++ compiler on Microsoft Windows supports extended precision, but requires the /Qlong‑double switch for long double to correspond to the hardware's extended precision format.[3]


Leave a Comment

Please sign in to add a comment. Not a member? Join today