Bug related to auto-parallel and SSE3

Bug related to auto-parallel and SSE3

I'm seeing a bug resulting in random/non-repeatable outputs that arises when I use "-parallel" in conjunction with "-xSSE3" or any higher SSE instruction set. Not using the auto-parallel flag works fine with any SSE (first noticed with -xHost), and using -parallel with -xSSE2 also produces correct output. I'm seeing this with ifort 13.1.3 as well as the x3_2013.1.117 version. CPU is an intel i5-750 (has SSE4.2), running on x64 ubuntu.

It'll likely take me considerably more time to locate the offending source so as to give a reproducible error, so I thought I'd ask here first in case someone has seen this quirk before. Any idea as to the cause of something like this, and/or where to start looking?

17 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Sorry, this description is far too vague to offer anything useful. If there is a compiler bug here, it would be due to some specific usage in the code that triggers it, not just the combination of features. If you can provide a test case it will help a lot. It is also possible that there's an error in the source code that gets exposed by certain operations - an uninitialized variable would be my first guess based on the symptoms.

Steve - Intel Developer Support

Apologies, I know it's extremely vague, just hoped for the off-chance this SSE aspect sounded familiar to someone. An unitialized variable is my first guess as well, just thought it odd to only show up when loops are being auto-parallelized. I'll try a manual binary search on the build system with the compiler flag and see if I can find the cause.

I traced back the culprit to the auto-parallelization of this innocuous-looking loop. The two arguments in the function call are slices of array pointers. Any thoughts?

subroutine InviscidWallUpdate(this)

  class(BCInviscidWall),intent(inout) :: this
  integer(kind=p_int) :: i,j

    do concurrent(j=1:this%jl-1,i=1:this%il-1)
      this%FloVarExt(:,i,j) = InviscidWallBC( this%FloVarInt(:,i,j), this%unitFaceNormal(:,i,j) )
    end do
end subroutine InviscidWallUpdate

pure function InviscidWallBC(W_in,faceNormal) result(W_ext)

  real(kind=p_real),intent(in) :: W_in(5),faceNormal(3)

  real(kind=p_real) :: W_ext(5),unorm(1:3)
    W_ext(1) = W_in(1)
    unorm = dot_product(W_in(2:4),faceNormal(1:3)) * faceNormal(1:3)
    W_ext(2:4) = W_in(2:4) - 2.d0*unorm
    W_ext(5) = W_in(5)
end function InviscidWallBC

Do these slices possibly overlap storage? Can you construct a self-contained test program that demonstrates the problem? I am a bit concerned about the use of explicit bounds in the function - how certain are you that these are correct for the arguments passed?

Steve - Intel Developer Support

It took some finagling, but I managed to create a self-contained demo (I'm sure it could be shrunk further). As near as I can tell, this example should be fully deterministic, as all the variables are initialized with constants determined at compile time. Compile with auto-parallelization:

mlohry@spitfire:~/development/ftest$ ifort -v
ifort version 13.1.3

mlohry@spitfire:~/development/ftest$ ifort -O2 -r8 -parallel -par-report -xHost parallelbug.f90
parallelbug.f90(64): (col. 8) remark: LOOP WAS AUTO-PARALLELIZED.
parallelbug.f90(23): (col. 5) remark: LOOP WAS AUTO-PARALLELIZED.

Non-deterministic output:

mlohry@spitfire:~/development/ftest$ ./a.out
1099.37618675320
mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065
mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065
mlohry@spitfire:~/development/ftest$ ./a.out
1102.73296858306

Non-parallel compilation,deterministic output:

mlohry@spitfire:~/development/ftest$ ifort -O2 -r8 -parallel -par-report parallelbug.f90

mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065
mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065
mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065
mlohry@spitfire:~/development/ftest$ ./a.out
1100.63617967065

That's on an i5 with v13.1.3, I get the same behavior on an AMD with 12.1.5. Most of the time this reduced example gives the same results, but maybe 1 in 3 does not. Hopefully you can reproduce this!

Fichiers joints: 

Fichier attachéTaille
Télécharger parallelbug.f901.81 Ko

Thanks - I can reproduce the issue, though I'm not yet convinced its a bug. Parallelization and vectorization can cause operations to be grouped differently and this can cause run-to-run small differences in floating point results. I'll take a closer look.

Steve - Intel Developer Support

Great, thanks. I wasn't claiming it to be a compiler-side bug, just noted that it was enough to cause catastrophic failure in my application where I didn't expect a small floating point error to do any damage.

Outputting norm2 of that whole vector washes over the error to make it look like a small floating point error, sorry about that. Attached is the same thing but outputting the first elements of the arrays, where the loop simply assigns the first element of input to the first element of output. The first and second rows should always be identical, but in some of them you get a 2.0 instead of 4.0, for example. Not a small floating point error one would get from rearranging floating point operations since there are none there to rearrange.

edit:  Changed the code to do nothing but assign the whole array from one to the other. The left hand side should always be exactly equal to the right hand side because it's just doing assignment, but 1 in 10 times it's not -- looks like the function results of a given index are going to the wrong place, i.e. the parallelized do-concurrent is not consistently using the correct indices.

Correct:

mlohry@spitfire:~/development/ftest$ ./a.out
LHS norm: 42.4264068711929 RHS norm: 42.4264068711929
2.00 2.00 2.00 2.00 2.00 === 2.00 2.00 2.00 2.00 2.00 1 1
2.00 2.00 2.00 2.00 2.00 === 2.00 2.00 2.00 2.00 2.00 2 1
2.00 2.00 2.00 2.00 2.00 === 2.00 2.00 2.00 2.00 2.00 3 1
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 1 2
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 2 2
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 3 2
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 1 3
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 2 3
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 3 3
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 1 4
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 2 4
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 3 4

Not correct:

mlohry@spitfire:~/development/ftest$ ./a.out
LHS norm: 43.1277173056956 RHS norm: 42.4264068711929
4.00 4.00 4.00 4.00 4.00 === 2.00 2.00 2.00 2.00 2.00 1 1
2.00 2.00 2.00 2.00 2.00 === 2.00 2.00 2.00 2.00 2.00 2 1
2.00 2.00 2.00 2.00 2.00 === 2.00 2.00 2.00 2.00 2.00 3 1
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 1 2
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 2 2
4.00 4.00 4.00 4.00 4.00 === 4.00 4.00 4.00 4.00 4.00 3 2
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 1 3
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 2 3
6.00 6.00 6.00 6.00 6.00 === 6.00 6.00 6.00 6.00 6.00 3 3
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 1 4
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 2 4
8.00 8.00 8.00 8.00 8.00 === 8.00 8.00 8.00 8.00 8.00 3 4

Fichiers joints: 

Fichier attachéTaille
Télécharger parallelbug.f902.25 Ko

Thanks - I have escalated this to the developers as issue DPD200246662 and will let you know of any progress. So far, I have been able to reproduce this only when building for IA-32, not when building for Intel 64 (x64). Are you doing a 32-bit or 64-bit build?

Steve - Intel Developer Support

This was all with x64 builds; I haven't tried 32.

Mark,

What happens when you add "recursive"

recursive pure function InviscidWallBC(...

Jim Dempsey

www.quickthreadprogramming.com

-parallel implies that.

Steve - Intel Developer Support

Mark,

In looking at your (incorrect) output, it appears that the entire 1st row, containing all 4.00's verses the correct 2.00's is indicative of a incorrect index being used as opposed to rounding error. If I were to guess (look), I would check to see if i, and j are thread-safe atomically captured and advanced through the iteration space.

My compiler version is older than current so I cannot test this hypothesis (examine Dissassembly window).

Jim Dempsey

www.quickthreadprogramming.com

Jim, agreed that the indices look like they're being used out of order. i and j are local to the subroutine and are only ever used in the concurrent do, so they certainly *should* be thread safe. I don't know near enough assembly to find what you're asking, so feel free to glance through the attached assembly output.

Fichiers joints: 

Fichier attachéTaille
Télécharger parallelbug-assem.txt315.98 Ko

>>so they certainly *should* be thread safe

This depends on the implimentation of the "do concurrent". DO CONCURRENT can perform the permutations of the input indexes (i and j) in any order. This is not the same as a PARALLEL DO with COLLAPSE(2) on DO's of i and j. While DO CONCURRENT could be implimented using equivilent code to the "PARALLEL DO with COLLAPSE(2) on DO's of i and j", it can, and may be implimented with an atomic pick next scheme. The symptom you have is indicative of an incorrect "atomic pick next scheme". This is only an observation/supposition on my part. Looking at the dissassembly code would show for sure.

Jim Dempsey

www.quickthreadprogramming.com

In this context, it is the compiler's responsibility to make sure that the indices are thread-safe. My guess is that the real problem is that the array slice code is perhaps using static storage inappropriately. The developers will figure it out.

Steve - Intel Developer Support

Mark,

I am unable to reproduce this problem on my site, fortunately Steve is. Additionaly, for this app, the autoparallization is not producing paralllel code for the DO CONCURRENT section. So further test on my system are inconclusive.

Jim Dempsey

www.quickthreadprogramming.com

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui