I'm finding a very puzzling problem with our code, but it is quite difficult to explain.
Our code is parallelized with MPI, and until now we were compiling it always with -O3 optimization and we didn't find any issues with it. I have tried to compile it with -O3 -xAVX and then I start to see buggy behaviour, which manifests itself in negative numbers in a density matrix. These values appear in the domain layers where the parallelization was done (i.e. we have a domain covering a certain height in Z, and when run in parallel each rank gets a chunk of this domain, dividing only in Z, and we start seeing negative values in the edge layers in each rank, where communication happens to send/receive the neighbour values).
When compiled without -xAVX we never see any issue, but when compiled with -xAVX I always see the problem arising after just a few iterations, and the values are not always the same, so there is some race condition or some uninitialized values somewhere, but which only seem to kick in after compiling with -xAVX.
Actually, to make things a bit more weird, our code is composed of three parts: main program, core library and module library. I can compile everything with -xAVX and only compile without -xAVX two things:
- one of the functions in the core library and the linking stage for the core library.
As far as I do it like that, then the buggy behaviour doesn't show up. The funny thing is that the function above doesn't have any MPI code, and actually it is only run once at the beginning of the code, to store some data that it is used later on throughout the execution of the code.
I have compiled the whole code with gcc and -mavx and in that case I don't get any faulty behaviour, but I don't know internally what gcc or icc do when setting -xAVX.
Any idea/suggestions on how to go about something like this? It certainly looks like a parallelization problem, but that only shows up when -xAVX is used is weird to me.