Performance of double structure indirection

Performance of double structure indirection

When I compile a 2D loop containing a double structure indirection in the innermost loop (a%b%c), I get a low code quality compared to (i) explicitly hoisting one indirection and (ii) using GNU 4.8 compiler. Should I use a particular flag, is there a "performance bug" in ifort ?
Thank you in advance
do i = 1, size_outer
   elt => elts(i)
   do j = 1, size_inner
      array (elt%p%val(j)) = 0.
   end do
end do

In the full application (with a lot of such double indirections in hot routines), I got a global 15-20% performance improvement by switching from Intel 14.0.1 to GNU 4.8.2 (or by hoisting double indirections). ifort O3 instead of O2 is not helping...
All required files are enclosed (.o and .asm not supported => grouped in asm.tgz):

  • ddo.f90
  • ddo_intel.o: from ifort -fpp -g -O2 -c ddo.f90 -o ddo_intel.o
  • ddo_gnu.o: from gfortran -cpp -g -O2 -c ddo.f90 -o ddo_gnu.o
  • ddo_single_intel.o: from ifort -fpp -D SINGLE_INDIR -g -O2 -c ddo.f90 -o ddo_single.o
  • ddo_single_gnu.o: from gfortran -cpp -D SINGLE_INDIR -g -O2 -c ddo.f90 -o ddo_single_gcc.o
  • ddo_*.asm: from objdump -d ddo_*.o > ddo_*.asm (plus extracting the innermost loop)
Descargar ddo.f90736 bytes
Descargar asm.tgz5.19 KB
publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Is my topic published (test) ???

I don't have time right now to investigate in detail, but what ifort should do with this is to make a memset substitution to zero out the array section, while gfortran should generate in-line code (vectorized if you so requested).  If you want the in-line code with ifort, you should apply a directive such as !$omp simd ; then ifort should apply whatever level of -opt-streaming-store you have in effect (default auto, where it looks at surrounding code and at any hints about loop count to decide whether to use non-temporal store).

If your array sections are too small, the memset substitution would incur excessive overhead for function call and checking size and alignment.  If they are large enough to switch to non-temporal, memset won't do it until they are extremely large, as it has no way of checking nearby code to see whether you want data to remain in cache.  In a large number of situations, the memset can be useful, e.g. for reducing generated code size.  So the compiler tends to use memset as much as possible, if you don't guide it in another direction.

Anyway, you must at a minimum examine ifort opt-report to see whether the compiler is doing what you want and whether your directives are having desired effect.

In the opt-report, with -DSINGLE_INDIR, "seems inefficient..." is given as a reason for not optimizing, apparently on the assumption that loop count is about 3, which might be a bug associated with MODULE function.  Without -D, it claims to see possible aliasing, which also looks like a bug.  The !$omp simd directive over-rides both concerns (activated in 14.x by either -openmp or -openmp-simd).

The compiler may require actual USE of a module to engage in normal default optimizations.  Your indication that you tested performance implies that you did that, but your full test case would be needed.]

The gnu asm you attached doesn't show any evidence of optimization, although it's more concise than your Intel asm.



2.75x speedup by using single instead of double indirection (driver is enclosed):

$ ifort -fpp -o driver driver.f90 ddo.f90
$ time ./driver
user    0m1.100s

$ ifort -fpp -D SINGLE_INDIR -o driver_sgl driver.f90 ddo.f90
$ time ./driver_sgl
user    0m0.394s

Missing enclosed file...


Descargar driver.f90496 bytes

I submitted the performance issue to compiler engineering: case is DPD200252390

thank you

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya