ia32 and intel64 different results

ia32 and intel64 different results


Problem summary:
when I compile and run the same code on an ia32 architecture and an intel64 architecture, I get different results, and the latter takes 20x as long.

- F2003-compliant code
- latest Intel compiler
- local machine: Intel Dual Core, Ubuntu 11.04 32-bit, Intel Fortran Composer 32-bit
- remote server: Intel Xeon 6-core, Ubuntu 10.04 64-bit, Intel Fortran Composer 64-bit (intel64)
- I use (and link) the MKL
- I use the online link advisor to set flags, and set the environment variables by sourcing the compilervars.sh script with the respective argument. The programs compile, link, and run without problems. The -warn all flag does not give any messages of interest, on the local machine I also tried -check all -traceback etc., but no errors.
- I use the same optional 'performance / debugging' flags, in particular I have -fast
- What happens is somewhat strange: I have a Newton-style rootfinder for a large nonlinear system, and on the server (64-bit) it takes about 20x as many evaluations of the nonlinear system as on the local machine to converge to a solution. Each evaluation seems to take approximately the same amount of time. The first evaluation returns roughly the same 'distance' measure, but later on the rootfinder seems to make very slow progress on the server, while it converges fast on the local machine.

Unluckily, the server does not have a 32-bit version installed to countercheck.
Could the instrinsic epsilon(x), or tiny(x) be the cause?
Any ideas?

Thanks in advance!


12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


Check your run-time diagnostic options. Subscript out of bounds checking (w/wo uninitialized variable checking) can yield this difference in performance (even with full optimization).

Second, if your application has contention for a non-OpenMP software lock (mutex, etc...) then more threads may mean higher contention for the lock. Check your software locks (if any) to see if this is a cause.

If your app is using OpenMP software locks, in particular critical sections, then be sure to name your critical sections. IOW do not use a single unnamed critical section, instead use one per given shared resource.

If your app is performing a large number of allocate/deallocate, either explicitly or implicitly by way of heap arrays, consider recoding to make these temporary arrays persistent in thread local storage. This will reduce/eliminate contention for the critical sectionof the heap.

Jim Dempsey

>>about 20x as many evaluations of the nonlinear system as on the local machine to converge to a solution

Sorry I missed this point.

Convergence issues generaly fall under two categories

1) Error in the consolidation of partial results from multiple threads (e.g. lack of use of reduction variable or missing atomic directive, or similar issue).

2) Your conversion chriteria is at the extreme limits of the precision of the floating point system. Verify that the option switchedsdo not result in one platform using FPU (with 80-bit temporaries), and the other system strictly uses SSE/AVX.

Jim Dempsey

Add to the list of possible culprits responsible:

1) Incorrect user-supplied jacobian. Check jacobian using FD approximation

2) Uninitialized variables used


thanks for your response.

All run-time diagnostic options are turned off for the benchmark run.

I don't know what software locks are, but I don't think my application is using any. While I do link an OpenMP library, I do not use any OpenMP features at the moment (though I am planning to).

Also, I wouldn't say that the app is performing a 'large' number of allocate/deallocate.


ad 1.): I don't think the application is using multiple threads. At least I haven't programmed anything in that way.

ad 2.): The conversion criteria are not very small, in the range of 1e-8. I don't understand what FPU and SSE/AVX do/ how I influence that, but given that the conversion criteria are not so small, it might not matter.

Thanks again.


thanks for your response.

I don't understand why the culprits you propose should make any difference when I run 32-bit vs 64-bit (see original post). In any case, I am doing FD approximation, no user-supplied jacobian, and when I turn on all run-time checks, I don't get any error messages about uninitialized variables used. Again, I think it then shouldnt work on 32-bit system either.


If your application requires implicit double precision evaluation of single precision expressions in order to meet a conversion criterion < EPSILON(1.), and you used x87 code in one or more of your 32-bit builds, it should not be surprising that it makes a difference. You never said which compilers you used; gnu compilers, and Intel compilers prior to 11.0, defaulted to 387/ia32 code generation in the 32-bit versions. In order to make such cases portable to architectures without implicit extra precision, you should specify all required promotions to double precision in the source code.


I am not 100% sure I understand what you say. The whole code is in double precision. As I said in the description, the compiler is the latest Intel (Intel Fortran Composer 32-bit, and 64-bit respectively).


If everything is declared double precision, the auto-double compile option could confirm that none have been missed which would affect your numerical results.
You've been short on details which might help focus the suggestions you've been getting.

Can you identify the convergence subroutine that is experiencing excessive compute times?
If so, can you post it here for review?

Some convergence issues are not always obvious to the unskilled programmer.
One such example is using single precision literal values in a double precision expression.

Jim Dempsey

when I turn on all run-time checks, I don't get any error messages about
uninitialized variables used.

The compiler is not always successful in detecting the use of uninitialized variables, particularly with variables of derived type and with arrays.

Again, I think it then shouldnt work on
32-bit system either

Incorrect reasoning. In the presence of 'undefined' variables, program behavior is 'undefined', according to the standard. Programs with this kind of bug have unpredictable behavior. Then can even seem to work correctly sometimes, but at other times may fail with a different compiler, OS, bit-size, ...

Leave a Comment

Please sign in to add a comment. Not a member? Join today