CORE_AVX2 vs. Fortran array assignments

CORE_AVX2 vs. Fortran array assignments

I've noticed that the CORE-AVX2 or xHost option sometimes produces AVX-128 code (about equivalent to SSE) where the AVX option produces AVX-256.  I've submitted a premier report in case this may be accepted as a bug.

It seems I was over-confident in assuming that AVX2 should perform at least as well as AVX. Such expectation seems to work out more often with F77 source code in conjunction with testing the various directives (much ifdefing of directives by architecture).  !dir$ simd or vector aligned may work with array assignment, but of course !$omp simd does not.

In some of these cases, /QaxAVX2 removes vectorization entirely even though /QaxAVX produces both AVX and SSE2 vector code.  The vector speedup estimate shows that SSE vectorization would kill performance (on some long-disappeared CPU?) and the vec-report advises use of directives.  Unfortunately, directives ruin performance sometimes when there is good vectorization without them.  Where I have to tinker with directives, the estimated vector speedup may be OK for one of the alternatives but wrong for others.

Vector speedup estimate is done through the vec-report7 option and python script with compilers 13.1 and 14.0.  It moves to opt-report4 with 15.0.  I'm guessing the numbers quoted there may relate to the "seems inefficient" diagnostic issued when the compiler decides not to vectorize.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi, Tim..  Is there something you'd like us to do here, or is it just informational? Since you've already entered a Premier Support case, I think that should be sufficient.

Steve - Intel Developer Support

The premier support issue 6000058922 is stalled waiting for my response, but the site hasn't been responding the last 3 days when I get to the point of clicking the SAVE button.  I was hoping that uploading the build steps produced by make -n might contribute to clarification.  If the site won't allow me to upload this it doesn't seem productive to attempt to create a visual studio project and wait for the site to accept that.

The point of this is to show the full test framework and demonstrate that the one case of AVX2 code produced by f77 source code is nearly twice the speed at longer loop counts of the SSE code produced by f90 array assignments, although the code produced by /arch:AVX (where array assignments do result in AVX code) isn't better than SSE.

One more comment:  make actually works better on Windows than linux in this case, but I know I won't get anywhere requesting that make be accepted in Windows reproducers. 

The one improvement I'd like is a way of capturing clock speed in Makefile.  A roundabout way is to install cygwin procps , run 'cat /proc/cpuinfo > /cygdrive/c/proc/cpuinfo' and grep the clock speed there, so as to be able to use the __rdtsc() wrapper supported by MSVC and ICL and scale it to real time.  On linux, system_clock does the job (with 64-bit integer arguments).  QueryPerformance works, at least in gfortran, but it's not even portable among compilers on the same OS.

For those who want it, the source code is also at https://github.com/tprince/lcd.

ifort -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -names:lowercase -Zi -align:array32byte -Qopt-report-file=lcdmod_opt.txt -Qopt-report4 -c lcdmod.f90
ifort -c -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -names:lowercase -Zi -align:array32byte -Qopt-report-file=mains_opt.txt -Qopt-report4 -Qip- mains.F
ifort -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -nam
es:lowercase -Zi -align:array32byte -Qopt-report-file=loopsfv_opt.txt -Qopt-report4 -c loopsfv.F
cl  -c -DCLOCK_RATE=2295000000 f90_msrdtsc.c
ifort -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -names:lowercase -Zi -align:array32byte -Qopt-report-file=_opt.txt -Qopt-report4 mains.obj loopsfv.obj f90_msrdtsc.obj /link /stack:80000000
mv mains.exe lcd_ffast.exe
mv mains.pdb lcd_ffast.pdb
ifort -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -names:lowercase -Zi -align:array32byte -Qopt-report-file=loops90_opt.txt -Qopt-report4 -c loops90.F
ifort -O3 -Qopenmp -assume:protect_parens,underscore -QxHost -Qunroll4 -fpp -names:lowercase -Zi -align:array32byte -Qopt-report-file=_opt.txt -Qopt-report4 mains.obj loops90.obj f90_msrdtsc.obj /link /stack:80000000
mv mains.exe lcd_f90.exe
mv mains.pdb lcd_f90.pdb

Leave a Comment

Please sign in to add a comment. Not a member? Join today