different results from run to run

different results from run to run

Steve Cousins's picture

I have some ocean model code (Princeton Ocean Model) that I run on a quad 4-core Xeon E7330 system. When I run this code I get slightly different results each time. I have tried different compiler switches but the basic ones are "-parallel -O2". Are there switches that are recommended to assure that results will be the same from run to run, hopefully without a huge performance penalty? Thanks a lot.

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Ronald W Green (Intel)'s picture

Quoting - Steve Cousins
I have some ocean model code (Princeton Ocean Model) that I run on a quad 4-core Xeon E7330 system. When I run this code I get slightly different results each time. I have tried different compiler switches but the basic ones are "-parallel -O2". Are there switches that are recommended to assure that results will be the same from run to run, hopefully without a huge performance penalty? Thanks a lot.

First, you will need the 11.0 compiler. There was a fix in the 11.0 compiler to fix the address of the global stack, which Linux allows to vary. This variance in global stack address caused some variability in data starting addresses, which could cause variability in some rare instances in results from vectorized sections of the code.

You could try -fp-model settings (start with -fp-model precise) with some performance impact. I am assuming that you have set OMP_NUM_THREADS env var to the number of real cores on the system (16, right?) and are not varying the number of threads, as this could affect the order of calculations.

I'd need more information on your compiler version to have anything more useful to add.

ron

Steve Cousins's picture

Quoting - Ronald Green (Intel)
First, you will need the 11.0 compiler. There was a fix in the 11.0 compiler to fix the address of the global stack, which Linux allows to vary. This variance in global stack address caused some variability in data starting addresses, which could cause variability in some rare instances in results from vectorized sections of the code.

You could try -fp-model settings (start with -fp-model precise) with some performance impact. I am assuming that you have set OMP_NUM_THREADS env var to the number of real cores on the system (16, right?) and are not varying the number of threads, as this could affect the order of calculations.

Hi Ron,

Sorry. It is version 10.1. I'll try the -fp-model precise switch to start with and if there are still differences I'll upgrade the compiler. Yes. I've been setting OMP_NUM_THREADS=4 actually. It doesn't scale up very well past this. I run a bunch of them at the same time though.

Thanks for your help. I'll let you know what happens.

Steve

jimdempseyatthecove's picture

Steve,

A multi-threaded program will not necessarily produce the same results from run to run when the parallel sections produce partial results that are sensitive to roundoff error dependent on sequence of calculation. With careful coding you can strive to produce consistent results from run to run assuming that the number of threads remain the same from run to run.

A simple example is the summation of a REAL(4) or REAL(8) array where the numbers stored have values with rounded results. Ths summation of this array will (may)produce different results dependent on the sequence in which you perform the summation. e.g. single thread summing from front to back of array may produce different result from single thread summing from back to front. As well as multiple threads each producing partial sums for a strip of the array and then producing a total sum by summing the partial sums. Note, even the order in which you produce the total sum with summation of partial sums may vary the result.

In the example of the partial sums to total sums you can attain consistency by having the same number of threads perform the summation from run to run and where the partial sums are stored seperately into a results array, then outside the parallel region, perform the total summation in the same order (e.g. from front to back of partial sums array). In this manner the partial sums are produced in the same manner from run to run, and the total sum is performed in the same sequence from run to run.

The above is a simple situation. It becomes harder to reconcile the results the more complecated the process becomes.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Steve Cousins

I've been setting OMP_NUM_THREADS=4 actually. It doesn't scale up very well past this. I run a bunch of them at the same time

It's difficult to get good OpenMP or -parallel performance on a 4 socket machine, if you meant that. Setting environment variables with KMP_AFFINITY or GOMP_CPU_AFFINITY would be an important step to reduce variations in performance. It looks like you would need to give each simultaneous job the list of core numbers for a different socket. Running a single job with various numbers of threads and KMP_AFFINITY=compact,0,verbose and the like would be a start to your experiment.
If the attempt to improve alignment consistency in 11.0 doesn't help, together with affinity, you should also check for data races, e.g. with Intel Thread Checker.

Steve Cousins's picture

Quoting - tim18

It's difficult to get good OpenMP or -parallel performance on a 4 socket machine, if you meant that. Setting environment variables with KMP_AFFINITY or GOMP_CPU_AFFINITY would be an important step to reduce variations in performance. It looks like you would need to give each simultaneous job the list of core numbers for a different socket. Running a single job with various numbers of threads and KMP_AFFINITY=compact,0,verbose and the like would be a start to your experiment.
If the attempt to improve alignment consistency in 11.0 doesn't help, together with affinity, you should also check for data races, e.g. with Intel Thread Checker.

Thanks for the information about CPU affiniity. Currently my goal is just to get run to run consistency which is proving difficult. After that I'll take a look at the env. vars you mention. It would be nice to get better performance with more cores being used.

Steve Cousins's picture

I have made various runs starting with the version I already had (10.1) and then changing to 11 after no luck. With 11: still no luck. Even running without -parallel so I can focus on the calculations as opposed to anything to do with threads (correct?) there are differences. The following is output of temperature and salinity values after compiling with:

ifort -O0 -fp-model precise *.f

run 1:
iteration temperature salinity
0 6.36662 31.87842
100 6.52780 31.83795

run 2:
iteration temperature salinity
0 6.36662 31.87842
100 6.52682 31.83747

After 100 iterations the values vary in the third to fourth decimal places.

Using -fp-model strict yielded similar results:

run 1:
iteration temperature salinity
0 6.36662 31.87842
100 6.52835 31.83663

run 2:
iteration temperature salinity
0 6.36662 31.87842
100 6.52647 31.83863

In fact, these show a wider difference than with -precise, although the sample size is only two in both cases so -strict very well could be the same or better overall.

Any other ideas of things to try?

Thanks,

Steve

I'm somewhat amazed that the OS hasn't been identified yet. Differences due to varying alignment of vectorized sum reduction normally were more prevalent on 32-bit OS, due to the lack of consistent alignments of allocation, but those should have been suppressed by either -fp-model precise or source. I'm not familiar with the fix mentioned early by Ron, as to whether it applied to 32- or 64-bit linux.

Steve Cousins's picture

Quoting - tim18
I'm somewhat amazed that the OS hasn't been identified yet. Differences due to varying alignment of vectorized sum reduction normally were more prevalent on 32-bit OS, due to the lack of consistent alignments of allocation, but those should have been suppressed by either -fp-model precise or source. I'm not familiar with the fix mentioned early by Ron, as to whether it applied to 32- or 64-bit linux.

It is Linux with an x86_64 kernel:

Linux merc 2.6.27.5-88.asl.2.fc7 #1 SMP Mon Dec 8 11:11:39 PST 2008 x86_64 x86_64 x86_64 GNU/Linux

The ifort compiler came from l_cprof_p_11.0.074_intel64.tgz

Does that help?

Steve

Martyn Corden (Intel)'s picture

Hi Steve,

There are of course ways, both obvious and subtle, in which one can make floating point output vary from run to run. For example, by allocating memory to store a trimmed date or time, which may change subsequent data alignment, which result in a different path through optimized code. Or certain ways of choosing random number seeds. Nevertheless, Jim and Ron have called out the most common causes that I am aware of.

The possible dependence on initial stack alignment was fixed in the 11.0 compiler, for both IA-32 and Intel 64, by aligning the stack pointer at the start of the main program to the next cache line boundary. In any case, the optimizations that could result in variations depending on this alignment are suppressed by fp-model precise.

More pervasive is the case of parallelized reductions described by Jim. When you compile with fp-model precise, vectorized reductions are suppressed. Threaded reductions resulting from parallel should also be suppressed, though I havent yet verified that personally. On the other hand, OpenMP reductions are specified explicitly by the programmer, so fp-model precise has no effect on these; likewise, MPI reductions.

We should focus first on getting consistent run-to-run results for serial code.
I presume you are not using the MPI version, and that you are running on the same Xeon system every time?
Its very surprising that you should still see a difference, even with fp-model precise or fp-model strict. Im not concerned about the size of any differences between runs, I am concerned that there are any differences at all.
What library calls does your program make? I suggest exporting OMP_NUM_THREADS=1 as a precaution, in case you are linking in code threaded with OpenMP. I suggest also setting KMP_VERSION=yes, this should print out a message if your application is initializing the OpenMP runtime library. Please continue using the 11.0 compiler. Has the main program been rebuilt with 11.0?

Next, do you continue to get different results when you keep rerunning, or do the same values start to repeat themselves? Alignment issues typically givea very limited number ofdifferent results. If you see many results, I would go and check carefully how your random number seeds are chosen.

Please let us all know what you find.

Martyn

jimdempseyatthecove's picture

Steve,

One other thing for you to check.

Make sure that all parallel regions of your code that update and/or use shared values do so in a thread-safe manner.

Make sure !$OMP ATOMIC and/or !$OMP CRITICAL(name) is used correctly.

Even with proper use of atomic and critical sections you may experience variations in results (exclusive of round-off error propagation discussed earlier).

An example might be each thread is working on convergence to solution. When convergence is detected a value is set (e.g. below an EPSILON) and tested by all threads for use as a bail-out condition. Depending on how you set and reference this value the bail-out may occur anywhere between 0 and thread number-1 number of additional convergence iterations. You can, to some extent,program around this, but it is hard and you may need to make compromises.

Jim Dempsey

www.quickthreadprogramming.com
Steve Cousins's picture

Quoting - jimdempseyatthecove
Make sure that all parallel regions of your code that update and/or use shared values do so in a thread-safe manner.

Make sure !$OMP ATOMIC and/or !$OMP CRITICAL(name) is used correctly.

Hi Jim,

At this point my tests are completely running it as Serial code. Still getting the differences.

Steve

Steve Cousins's picture

Quoting - Martyn Corden (Intel)

We should focus first on getting consistent run-to-run results for serial code.
I presume you are not using the MPI version, and that you are running on the same Xeon system every time?
Its very surprising that you should still see a difference, even with fp-model precise or fp-model strict. Im not concerned about the size of any differences between runs, I am concerned that there are any differences at all.
What library calls does your program make? I suggest exporting OMP_NUM_THREADS=1 as a precaution, in case you are linking in code threaded with OpenMP. I suggest also setting KMP_VERSION=yes, this should print out a message if your application is initializing the OpenMP runtime library. Please continue using the 11.0 compiler. Has the main program been rebuilt with 11.0?

Next, do you continue to get different results when you keep rerunning, or do the same values start to repeat themselves? Alignment issues typically give a very limited number of different results. If you see many results, I would go and check carefully how your random number seeds are chosen.

Hi Martyn,

I agree with trying to get it to run correctly with serial runs. This is not MPI code and there are no explicit OpenMP calls. Any parallelism is done through the -parallel switch, which I have not been using in my later runs. I'll set OMP_NUM_THREADS=1 and KMP_VERSION=yes.

There are a couple of parts to this code. First is the physical model which calculates velocities, salinity, and temperature. The other part of this model uses the physical model for input to the development of biological tracers. For instance the currents influence where the tracer goes. The temperature influences how fast the tracer develops. I mention this because this second part does use a random number generator in the tracer trajectory calculations. However, the physical model does not use the random generator. All of the numbers that I am watching are part of the physical model and these should be the same each time. There is no feedback such that the trajectories would influence the physical model.

In any case, I'll run it a few times to see if I get distinct sets of results or if they are always different.

Thanks for your help.

Steve

jimdempseyatthecove's picture

When your serial code is getting different results from run to run there are a few likely suspects:

1) Use of uninitialized variables
2) different input data
3) time dependent input data (same as 2) (e.g. read an insturment)
4) thermal problems (e.g. from over clocking)
5) flaky RAM or I/O cards

Use of uninitialized variables (potentially from typographical error) is often the main source of problems.
(also the false assumption that variables not reference before use are initialized to 0)

Jim Dempsey

www.quickthreadprogramming.com
Izaak Beekman's picture

This may not be very helpful, but, on the subject of initializing variables, you can try adding the -zero switch to initialize all unspecified variables to 0. Also it may be worthwhile to enable as many compiler warnings as possible, i.e. -g -trace -gen-interfaces -warn-interface -warn -check (-zero). Lastly, a bug I found hard to find was that if you initialize local variables in procedures when they are declared (i.e. real :: foo=0.0) they automatically obtain the SAVE attribute. Therefore if you have some driving program written in fortran (or c?) which calls some procedure multiple times where a local variable is initialized, it will retain the value upon return fro each call rather than being reset -- i.e. foo = foolast not 0.0 as one might have intended.

Also has the physical model portion been validated? Does this non-repeatability occur on other systems, or using other compilers? (gfortran for example) Have other people experienced this non-repeatability, or does the integration of this code with your new code cause this problem? I have had some really annoying hardware problems caused by dying nodes, etc. and these can be very challenging to trouble shoot. Testing the code on a different machine may be worthwhile to eliminate this possibility.

Also, if the code is large and you get really stuck trying to fix this problem, a unit testing approach may be tedious and time consuming at first, but may help you go through each routine and ensure it is behaving correctly, and as you would expect. Then, once the system is in place, it can help you quickly diagnose and fix future oproblems. fUnit seems to me to be the best of the available unit checking frameworks although it seems that one could possibly do better, perhaps if one wrote the testing harness and/or the tests in python and called the various fortran routines from herel. fUnit is written in ruby but the test cases are coded in fortran with some custom macro type features.

Hope some of this helps.
-Z

-Zaak

-zero may work in many of the situations where certain compilers of a bygone age could initialize to zero. It's not guaranteed, and it won't by itself give SAVE status to a variable.
If you mistakenly used that initial value setting when you intended a plain assignment at run time, Intel thread checker might be able to catch it if it's in a parallel region.
These are 2 conflicting situations; one where the programmer might be depending on implicit SAVE where the standard doesn't supply it, and the other where implicit SAVE, specified by Fortran standard, is not what is intended.

Izaak Beekman's picture

Quoting - tim18
-zero may work in many of the situations where certain compilers of a bygone age could initialize to zero. It's not guaranteed, and it won't by itself give SAVE status to a variable.
If you mistakenly used that initial value setting when you intended a plain assignment at run time, Intel thread checker might be able to catch it if it's in a parallel region.
These are 2 conflicting situations; one where the programmer might be depending on implicit SAVE where the standard doesn't supply it, and the other where implicit SAVE, specified by Fortran standard, is not what is intended.

Indeed, I wasn't as clear as I could have been. I was not trying to imply that the -zero flag was in anyway associated with the SAVE attribute, just merely point out a potential pitfall of accidentally initializing a local variable with the (implicit) SAVE attribute when one intended to set it to some value at the beginning of each call to the procedure.

-Zaak
Martyn Corden (Intel)'s picture

These are sensible suggestions, though I think you'd be pretty unlucky to repeatedly pick up different values for uninitialized variables that made a significant difference without breaking anything.

I note that there is a compiler option -ftrapuv to initialize scalar variables on the stack to an "unusual" value that may provoke an exception ifone is used before being initialized. This is complementary to -zero, which only applies to static, scalar variables. Use in conjunction with -fpe0 to unmask some floating-point exceptions and -traceback to get a lightweight stack trace. None of these options applies to arrays. I believe -check uninit is similar.

Steve Cousins's picture

Thanks all for the dialog and suggestions. Unfortunately I have been taken away from this problem temporarily to install a new (to us) 64 CPU SGI (Rackable?) system at our Supercomputer center. Apropos only that we'll be running a newer version of our model on this system in the near future. I'll get back to you with your points. I'll just say that we were able to get consistent results using the Portland Group compiler on Opteron systems. I've obviously got a lot of testing to do. It has been very helpful to have you point out directions that you think are the best avenues to take. Thanks again. I'll update this when I can.

Login to leave a comment.