-axAVX and vectorisation handicap

-axAVX and vectorisation handicap

Hi,

I am trying to vectorise the code that is attached. Why is the compiler not able to vectorize if I add the AVX set to the -ax option?

-vec-report=1 tells me that vectorization of the loop was successful if:
ifort -xAVX ...
ifort -xSSE4.2...
ifort -axSSE2,SSE4.2 ...

It does not report a successful vectorization when trying to:
ifort -axAVX ...
ifort -axAVX,SSE4.2

Can anyone shed some light on this?

The compiler version I am currently using is: ifort (IFORT) 12.1.0 20110811

Best regards
Andreas

AttachmentSize
Downloadapplication/octet-stream zero.f90459 bytes
12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Btw: I also tried ifort version ifort (IFORT) 13.0.1 20121010
Same behaviour.

-ax...

          generate code specialized for processors specified by <codes>
          while also generating generic IA-32 instructions.

The particular code you provided "A = cmplx(0.0, 0.0)" is likely calling intel_fast_memset (or something to that effect). As opposed to an implied do loop using vectorization.
That subroutine will wipe the array at memory bandwidth (using vectorization inside the subroutine).

Don't worry about the lack of vectorization in the report (unless you get error message too).

Jim Dempsey

www.quickthreadprogramming.com

If you compile with -opt-report you should get some additional information.  I am seeing the comment "vectorization possible but seems inefficient" together with "memset generated" which (I suppose) means that the compiler recognizes the possibility of setting the array to zeros using the library memset function.  It seems to prefer that method rather than generating in-line code for several architectures.  You may see a difference if you compile strictly for AVX (-xAVX) so that you don't have the issue of spending time choosing among versions of which an AVX one may have no advantage; there's probably no advantage in having an AVX version for code which does nothing but zero out an array, as true 256-bit stores don't come until Haswell.

I've been informed that !dir$ simd is supposed to suppress memset, but you could expect it to require an F77 DO loop.  Further, I've seen the promise of no memset working for C but not for Fortran (and I've raised this question in a premier issue).

memset should perform well if the array is big enough, particularly if it is so big that memset switches to nontemporal.  The automatic switch to nontemporal in memset will not occur if there is reasonable chance that you will want to keep cache warm for subsequent use of this array.

memset also can't discriminate according to array size and number of threads used.  Nontemporal can be an advantage for moderate array sizes when not using threads, in the case where you don't use the array again soon enough to want it to remain in cache.

I see that Jim was answering while I was finishing my reply.  I would agree with his "don't worry" unless you get to the point of profiling and find that too much time is spent in memset.  In my experience, the compiler may be too quick to make the memset substitution, but it's an obvious way to control code explosion when you ask for code targeting several architecturs.

Jim,

I would be fine with that, but I get a performance drop on an AVX capable machine if I do not get the vectorization message.
But you are right, the assembler shows the call to memset:
--

Jim,

I would be fine with that, but I get a performance drop on an AVX capable machine if I do not get the vectorization message.
But you are right, the assembler shows the call to memset:
--
..B2.9:                         # Preds ..B2.8
        movq      %r14, %rdi                                    #20.4
        xorl      %esi, %esi                                    #20.4
        movq      152(%rsp), %rdx                               #20.4
        call      _intel_fast_memset                            #20.4
--

I am therefore even more surprised that I get the performance drop in the program where this code was taken from.

Is there usually a big penalty in calling functions versus having the vectorized statements directly in the code?

Andreas

Tim,

Thank you very much for the explanation.

It seems that the code I am benchmarking at the moment is struggling with exactly the issue you have mentioned.
The memset seems to be slower than the vectorized code.

I have already tried the SIMD directive, but it did not vectorize. Also with the do loops. One strange thing there is that I modified old Fortran 77 code (fixed format) to Fortran 90 (2003). The old Fortran 77 code does vectorize nicely with all the combinations of the options.
But since the Fortran 90 conversion it does no longer play nice.

Any way I can avoid using Fortran 77 and still tell the compiler to vectorize and not to use memset?

Andreas

I've found myself that the memset can produce an annoying drop in performance, particularly in the cases where nontemporal can pay off but memset doesn't see a large enough array to switch over.  I didn't make a detailed study, but I would prefer to avoid memset for arrays of less than 16KB; thus my interest in ways to avoid it without setting the global compile options.

Regarding the !DEC$ SIMD:
I tried it again with the source code I attached and the result is:
--
zero.F90(20): warning #7866: The statement following this DEC loop optimization directive must be an iterative do-stmt, a vector assignment, an OMP pdo-directive, or an OMP parallel-do-directive.
!DEC$ SIMD
------^
zero.F90(21): (col. 4) remark: SIMD LOOP WAS VECTORIZED.
zero.F90(9): (col. 12) remark: F has been targeted for automatic cpu dispatch.
--

I will run some benchmarks again with this. Not sure why this did not work before in the bigger program.

Andreas

"I would be fine with that, but I get a performance drop on an AVX capable machine if I do not get the vectorization message."

What portion of your program is spent initializing arrays to 0.0?
I would imagine it is less than 1%. Your optimization efforts will be better spent looking elsewhere.

Also, might I suggest making two versions of your execuitable. One version for processors that predate P4, and another that is from P4 to later. (or 3 versions, pre-P4, pre-AVX, AVX and later)

If you are really nit-picking over _intel_fast_memset, then taking out additional code paths and tests should recoup similar performance differences elsewhere.

Jim Dempsey

www.quickthreadprogramming.com

Thank you very much everybody for the suggestions.

The directive "!DEC$ SIMD" seems to enforce auto-vectorization in most places.

The assignment to zero was only one of the loops/array expressions in the Fortran code that did not got auto-vectorized when using -axAVX. There are lots of assignments of entire or parts of matrices, a la A = B or A(b:e) = B or A = B(b:e) as well (multi-dimensional arrays too). The compiler seems to replace those with memcpy in most cases if -axAVX is used; strangely enough not if I specify -axSSE4.2.

Andreas

Quote:

Andreas Klaedtke wrote:

Thank you very much everybody for the suggestions.

The directive "!DEC$ SIMD" seems to enforce auto-vectorization in most places.

The assignment to zero was only one of the loops/array expressions in the Fortran code that did not got auto-vectorized when using -axAVX. There are lots of assignments of entire or parts of matrices, a la A = B or A(b:e) = B or A = B(b:e) as well (multi-dimensional arrays too). The compiler seems to replace those with memcpy in most cases if -axAVX is used; strangely enough not if I specify -axSSE4.2.

Andreas

-axSSE4.2 doesn't request multiple code versions for copying data, as -axAVX appears to do. 

I find the cases more worthy of the "strange" epithet where !dir$ simd works as expected for -xAVX and apparently is ignored for -xSSE4.1.  I have seen no use for -xSSE4.2 in the past, as I don't know any cases where it would be superior to SSE4.1, unless on a CPU which also supports AVX.

Leave a Comment

Please sign in to add a comment. Not a member? Join today