32- versus 64-bit code performance

32- versus 64-bit code performance

I have recently migrated a Fortran/C++ application from 32- to the 64-bit compiler. Finally I got a chance to compare performance between the two versions and I haven't seen any performance gains from this migration. The test was performed on two identical machines (Intel 64 bit processors); one runningWindows XP 64-bit and the other Windows XP 32-bit. Compiler settings are the same for both versions. I expected the machine running 64-bit Native code would out perform the one running 32-bit code, but apparently that's not the case.

Are there any compiler settings that are unique to the 64-bit processor that might improve performance of the 64-bit code? Other suggestions are also welcomed.

Thanks

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Quoting - ocean

I expected the machine running 64-bit Native code would out perform the one running 32-bit code, but apparently that's not the case.

Why? I mean, is your app going to benefit from available address space? or any other feature of 64-bit config? Maybe test case is "small". What's the type of your application?

A.

In general, you should not expect to notice a performance increase going to 64-bit. The primary advantage is the removal of the 2GB virtual address space limitation. There are some more registers that can improve code performance, but in most cases it won't be significant. There can be a downside as well if your application stores lots of addresses in memory (more for C than Fortran) because the data size doubles.

Retired 12/31/2016

ocean,

When using MS Visual Studio, the default project platform setting is Win32. You must click on the platform pull-down and select New, then select x64 from the nextpulldown, say Yes tocopy settings from Win32

If you build your 32-bit app as a 32-bit app on the x64 system, it will run in 32-bit mode.

My general experience with x64 Fortran applications is 15%-30% faster with noother changes to the code.
The extra memory is nice too.

Jim Dempsey

Guru Steve says no, Guru Jim says yes. Funny. Who is right? We need statistical evidence how the combinations of 32 and 64 bit Windows and 32 and 64 bit Fortran influence and improve the run time of programs (extended address space agreed).
There's another point in the 32 vs. 64 bit discussion: the performance of floating point operations. I once learned that computing with real(8) variables is not slower than with real(4) on IA32 processors, because the processors compute floats with 80 bit precision anyway what is more than the 64 bit of real(8). I have the impression from remarks in the forum that this assumption is wrong. Any comments on that topic?

ocean,

When using MS Visual Studio, the default project platform setting is Win32. You must click on the platform pull-down and select New, then select x64 from the nextpulldown, say Yes tocopy settings from Win32

If you build your 32-bit app as a 32-bit app on the x64 system, it will run in 32-bit mode.

My general experience with x64 Fortran applications is 15%-30% faster with noother changes to the code.
The extra memory is nice too.

Jim Dempsey

Jim,

The 15% to 30% gain in performance that you report might be realistic taking into account that bothapplications, 32- and 64-bit, are running on the same 64-bit Operating System, and it is understandable that the 32-bit application will always be slower because of OS emulation.

My test case is different. I have two identical 64-bit machines, one running Win XP (32) and the other Win Xp (64). The 32-bit application runs on the 32-bit OS, and the 64-bit application runs on the 64-bit OS. Each application runs on their own native environment, and. in this case,OS emulation is not necessary, andperformance appearsto be very similar between the two.

There areno doubts that migrating to a 64-bit operating system is beneficial, even if it is just for the benefits of having available memory above the 2G barrier. However, it appears that performance gains, when compared to the 32-bit OS, might not always be attained.

Thanks

Steve pointed out that 64 bit is potentially faster only when the application takes advantage of one of the higher limits, such as address space or number of registers. It is likely to require more RAM, more than 1GB anyway, to be a worth while shift. ifort occasionally performs additional useful loop unrolling in the 64-bit compiler, as the compilers are aware of the code size vs performance trade-off.
You must be a proponent of lying statistics if you misapply them to a mostly deterministic situation.
If you have a mixed single and double precision application which runs best with x87 code, the lack of support for that in 64-bit Windows compilers will hurt you. If you insist on using default options only, with older Intel 32-bit compilers which didn't have SSE2 vectorization on by default, you could see a tremendous advantage in 64-bit. The deprecated ifort option /Op is very slow in the 64-bit version, as it doesn't use x87.
Windows, or compilers like Intel's which comply with Microsoft usage, set x87 53-bit precision mode, so you don't use all 80 bits unless you reset precision mode, and you don't spend the extra time required for full length divide and sqrt. On AMD CPUs, most single precision scalar SSE operations are faster than x87 or SSE2 double precision. On Intel CPUs, only operations like sqrt and divide take longer for higher precision.

In general, you should not expect to notice a performance increase going to 64-bit. The primary advantage is the removal of the 2GB virtual address space limitation. There are some more registers that can improve code performance, but in most cases it won't be significant. There can be a downside as well if your application stores lots of addresses in memory (more for C than Fortran) because the data size doubles.

Steve,

Regarding the extra registers that you mention, is this something that is only available to the 64-bit OS and not the 32? The two machines in my test are both 64-bit processors, and if I understand your comment,the one running the 32 OS will not be able to take advantage of the extra registers.

Thanks

Quoting - ocean
Regarding the extra registers that you mention, is this something that is only available to the 64-bit OS and not the 32? The two machines in my test are both 64-bit processors, and if I understand your comment,the one running the 32 OS will not be able to take advantage of the extra registers.

That's right, the additional registers are active only when the CPU runs with 64-bit mode bit set.
Likewise, general register support for native 64-bit integers (not as pairs of integers) is present only in 64-bit mode.
A 32-bit compiled program running in 64-bit OS doesn't see the extra registers, but it does potentially get access to full 4GB address space.

Ocean,

In my case my "test application" is a finite element simulation program designed for usefor simulatingtethers. Think of it as flexible 2D objects in 3D space (as opposed to ridged wire frame FEA). A tether has length and width (e.g. aerodynamic cross section or solar absorbtion cross section). The simulator (Fortran application) uses REAL(8) throught as REAL(4) does not provide sufficient precision. The program is relatively complex: consisting of 13 projects and ~750 source files. IOW it is not a simple test application.

The compilation options are set for use of SSE3 and it is multi-threaded using OpenMP

When

Compiled as 32-bit application and run on 32-bit Windows on 4-core Q6600

as compared to

Compiled as 64-bit application and run on 64-bit Windows on same 4-core Q6600

The application ran 15%faster in the 64-bit enviornment.

Granted, this is but one test case and your milage may vary.

On examination of the code, the principal reasons for speedup were:

a) The extra registers available ment the optimizer could place more of the local declared variable in registers as opposed from pulling them from RAM (or L1 or L2 cache by way of memory address). In 64-bit mode you have 2x the number of SSE registers and more than 2x the number of integer registers (6 of the lower 8 integer registers are available plus all8 upper registers are available). So register pressure is lower.

b) x64 calling convention is __fastcall. Therefore on calls that used .LE. 4 args, the args were passed by register. On calls with more than 4 args. 4 fewer args needed to be pushed on to the stack. Fewer memory writes.

These observations lead to: If you have serious computational problems, even when run on systems with 2GB or less RAM, use a 64-bit operating system and compile for 64-bit application.

Jim Dempsey


Ocean,

In my case my "test application" is a finite element simulation program designed for usefor simulatingtethers. Think of it as flexible 2D objects in 3D space (as opposed to ridged wire frame FEA). A tether has length and width (e.g. aerodynamic cross section or solar absorbtion cross section). The simulator (Fortran application) uses REAL(8) throught as REAL(4) does not provide sufficient precision. The program is relatively complex: consisting of 13 projects and ~750 source files. IOW it is not a simple test application.

The compilation options are set for use of SSE3 and it is multi-threaded using OpenMP

When

Compiled as 32-bit application and run on 32-bit Windows on 4-core Q6600

as compared to

Compiled as 64-bit application and run on 64-bit Windows on same 4-core Q6600

The application ran 15%faster in the 64-bit enviornment.

Granted, this is but one test case and your milage may vary.

On examination of the code, the principal reasons for speedup were:

a) The extra registers available ment the optimizer could place more of the local declared variable in registers as opposed from pulling them from RAM (or L1 or L2 cache by way of memory address). In 64-bit mode you have 2x the number of SSE registers and more than 2x the number of integer registers (6 of the lower 8 integer registers are available plus all8 upper registers are available). So register pressure is lower.

b) x64 calling convention is __fastcall. Therefore on calls that used .LE. 4 args, the args were passed by register. On calls with more than 4 args. 4 fewer args needed to be pushed on to the stack. Fewer memory writes.

These observations lead to: If you have serious computational problems, even when run on systems with 2GB or less RAM, use a 64-bit operating system and compile for 64-bit application.

Jim Dempsey

Jim,

Thanks for taking the time to look into this issue.

Our application is aC++/Fortran simulation that performs a great deal of scientific calculations. This simulation is used as an aid in the development of a product.The simulation is constantly being modified as the product goes through the different development phases. The people that use it are both users and developers and they are constantly switching from debugging mode (when they need to debug new code) to Release version (when they need to run scenarios). The Release version has all optimization disabled. Sometime ago we tried to enable optimization (only the Maximize Speed option), this was on a 32-bit OS with IVF 9.0, and my recollection is that results from the Release version with optimization didn't exactly matched those from the debug version. It appeared that it was due to loss of precision during floating point calculations (this is my guess). When the application crashes in Release mode,the task of tracing the problem becomes much harder when you can not compare Release and debug mode behaviorbecause the resultsare not exactly the same.

So it looks like from reading your e-mail that we might not see any measurable performance gains in 64-bit mode unless we enable optimization.

I would appreciate it if youcould address the loss of floating point precision that I mention above, just to confirmthat optimization carries a penalty (loss of precision), and whether there's a way of avoiding loss of floating-point precision without changing the source code and still be able to take advantage of optimization.

Thanks

ocean,

On both platforms (32-bit and 64-bit) and for both configurations (Debug and Release) you can select to expressly use FPU or SSE3. So you can maintain the precision of expressions if desired.

Also, for a complecated application such as yours you can mix the FPU/SSE3 options between source files. You will find that many subroutines in your application are not sensitive to calculations being performed in 64-bit or 80-bit. For those routines consider using SSE3 as vectorization is a great benefit.

I generally create several configurations Debug, DebugFast, DebugOpenMP, DebugOpenMPFast, Release, ReleaseFast.

Typically I run in DebugOpenMPFast where optimizations are set to fastest for all but a few routines under examination.

Jim Dempsey

If your application depends on promotion of expressions to double precision, you really ought to write it in the source, rather than depending on options such as /Op. ifort doesn't support options like /fp:double, as icl does, presumably on account of performance, the lack of a tradition of depending on it, the lack of control over which expressions are altered by it, and the extent to which it depends on optimization level.


ocean,

On both platforms (32-bit and 64-bit) and for both configurations (Debug and Release) you can select to expressly use FPU or SSE3. So you can maintain the precision of expressions if desired.

Also, for a complecated application such as yours you can mix the FPU/SSE3 options between source files. You will find that many subroutines in your application are not sensitive to calculations being performed in 64-bit or 80-bit. For those routines consider using SSE3 as vectorization is a great benefit.

I generally create several configurations Debug, DebugFast, DebugOpenMP, DebugOpenMPFast, Release, ReleaseFast.

Typically I run in DebugOpenMPFast where optimizations are set to fastest for all but a few routines under examination.

Jim Dempsey

Jim,

Thanks for you advice. I have reconfigured my compiler settings for bothRelease with optimization and wihout it, andnow the results produced by both versions don't show any discrepancies due to loss of floating point precision.

I have also compared results from my application runningon32-bit and 64-bit OS (on identical machines), and I can now confirm that in most of scenarios I ran, the 64-bit application appears to over perform its 32-bit counter part. I had one particular scenario where the 64-bit application just barely came ahead (4%). In general the average performance gains appear to be about 20%. I also agree with you in that the actual gains depend on the character of the application (source code) itself.

Thanks

Leave a Comment

Please sign in to add a comment. Not a member? Join today