AVX thread safety problem

AVX thread safety problem

I've got a multi-threaded program that calls sequential dgemm() from multiple threads. If I run this program on a Sandy Bridge processor, using the latest MKL (from C++ Composer 2011.2.137), I get subtly different numerical results each time I run the program. Not wrong answers - just small differences in the low-order bits. If I run the same program on an earlier processor (e.g., i7-920), I get the exact same numerical result each time I run it. If I run my program using only one thread on a Sandy Bridge processor, I get the exact same numerical result each time. If I use an older MKL (e.g., 10.2.6.038) on Sandy Bridge (no change to my program, other than linking with a different MKL version), I get the exact same numerical result each time (but slower, of course, since it doesn't use AVX).Seems like there's some sort of thread safety problem in the AVX code inside MKL. Any known issues here?

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Ed, your name sounds familar - seems we may have worked together many years ago. MKL may run a different code path based on the processor type, array alignment or # of threads.Perhaps that is what you're running into, see this article: http://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl/ If these tips don't address your issues, we would certainly be interested in a test case.
Thanks,
Shane Story

Hi, Shane. Good to hear from you. You sat in the cube next to me at Cornell Oaks back in 1993.Thanks for the pointer to the article. None of the issues mentioned there seem to explain my symptoms, though.... I'm already using the sequential MKL (on each thread), so threading shouldn't cause an issue.. I'm getting different results from one run to the next on the exact same machine, so processor type shouldn't be an issue.. I've verified that the 16-byte alignment doesn't change from one run to the next.Another piece of data: I see the problem when I compile for x86-64 Linux, but the x86 Linux version (same source code, linked against the same version of MKL, but 32-bit of course) doesn't have the problem.I'll try to come up with a simple example that illustrates the issue.Ed

As you indicate that the numerical differences with threading occur only with AVX, I would suspect that you don't get consistent 32-byte alignments. This could be an issue with the OS (or binutils) not offering MKL the capability to control 32-byte alignment. There is a build parameter for this in binutils, but of course that's not the entire story.
I've been told the OpenMP run-time library allocates with the maximum available alignments, but don't know the details.

Thanks for the reply. Given the symptoms (subtle changes in numerical results), it does seem that memory alignment is most likely the cause.The AVX kernels in MKL require 32-byte alignment to obtain consistent results? Was this in the MKL documentation?I'll need to verify this, I believe that my code only allocates 3 relevant arrays. If the alignment of my data were the only variable, I'd expect to only see 8 possible numerical results (each of the 3 arrays either 32-byte aligned or not, giving 2^3 possible outcomes). I get a different outcome every time I run my program (I just verified that I got 100 different results in 100 runs).Is there any chance that the MKL kernels are dynamically allocating their own internal buffers, and those are not 32-byte aligned? Here's what I'm thinking. Suppose that MKL dynamically allocates an internal buffer for each thread, and those buffers aren't 32-byte aligned. That would mean that each thread would give a subtly different result for a dgemm() call. When my program dynamically assigns a task to a thread, the result would vary, depending on the alignment of the internal buffer on the thread that happened to process that task. That could certainly explain the wide variety of solutions I'm seeing.Ed

I figured out the problem. Shane and Tim, thanks for your very helpful responses.

Hi Edward,
I'm experiencing a very similar issue, can you provide more details about how you fixed this problem?

Thank you!
Regards,

Yanick Ct

I am having the same issue (or at least I think it is) with ifort (IFORT) 12.1.0 20110811 in an OpenMP program, not calling any of the MKL functions.

Could you please share the solution that you found? Is it just about proper alignment?

Thank you!
Best regards
Andreas

Quote:

Andreas Klaedtke wrote:

I am having the same issue (or at least I think it is) with ifort (IFORT) 12.1.0 20110811 in an OpenMP program, not calling any of the MKL functions.

Could you please share the solution that you found? Is it just about proper alignment?

Thank you!
Best regards
Andreas

This doesn't look to me much like the subject of the earlier part of the thread.  The only way I know of where ifort has an option to call or not call MKL is the -opt-matmul option, where Fortran MATMUL is converted automatically to call MKL.  For those versions of ifort which implement this, it is set implicitly by -O3 and may be set outside -O3 by -opt-matmul.  If you have MATMUL in a parallel region, even if MKL is called, it will not use additional threads unless OMP_NESTED is set.

I don't see how alignment issues could cause an MKL function call to be skipped.

The reason why I think it is related is because AVX seems to be the culprit and the symptoms are similar:
Threaded code that if compiled with -O3 -axAVX,SSE4.2 produces different results if rerun.
If compiled with -O3 -axSSE4,2 it produces identical results.

But I do not think that the MKL functions are used anywhere or are related to the issue I am seeing (I only change the compiler flags of one Fortran source file), the rest remains the same. And this one source file has no links to the MKL.

It would be great though to see how this threads issue was resolved in the end.

Should I open another thread?

It would be easy to check (ldd) whether Intel MKL is used due to the an implicit option as mentioned by Tim. Anyhow, let's assume it's not about Intel MKL -- you're then back to memory alignment, different code paths, and multi-versioned code. The latter picks a version in the generated code depending on a heuristic (e.g., small trip count asks for scalar code, etc.).

Anyhow, in case of the Intel Fortran Compiler V13 you can supply "-align array32byte" on the compiler's command line. This aligns the start of arrays accordingly when possible. Beyond this, you may have a look at http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler, and in particular have a look into the PDF/paper that's attached to this article (IMF precision/consistency, etc.).

It was an alignment issue.

Thank you very much for the reply!

Best regards
Andreas

Leave a Comment

Please sign in to add a comment. Not a member? Join today