Stencil loop is not AVX vectorized

Stencil loop is not AVX vectorized

The attached example is vectorized in a sense of SSE2 but not in a sense of AVX. There are no ymm register references in the assembly source. When I replace the X-1 index with X in line 13 then I get true AVX code. Is there something I overlooked ? Thanks for any hints.


The command line is

/opt/intel/composer_xe_2013.2.146/bin/intel64/ifort -r8 -O3 -xAVX -vec-report6 -c -S test.f90

Downloadapplication/octet-stream test.f90807 bytes
11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The current compiler ( reports 2 versions of vectorized loop.

The compiler appears to have concluded that AVX-128 will be better optimized for Sandy Bridge than AVX-256, on account of the number of misalignments and variable strides.  CORE-AVX2 will generate AVX-256, with or without -no-fma.

Are you using assumed shape intentionally?

Your ix-1 and iy-1 references appear to be out of bounds; the -c and -S options conflict.


thanks, you are right about the bounds violation, the loops should start with 2 instead of 1.

Regarding shape, in the real program the arrays are module global ones. Does that make a difference ?

I would hope that the compiler has better information available in a module procedure than with assumed shape.  Maybe it wouldn't require 2 versions of generated code.

I still would not be entirely surprised to see a choice of AVX-128.  I don't know whether Ivy Bridge CPUs, which have better support for 256-bit unaligned loads, would make those worth considering, but I don't think the compilers make such a distinction.

It is interesting how argument passing affects optimizations.  Our compiler developers expose exactly how passing arrays affects optimization in this document:

Putting data in modules and USE provides the compiler with the most information about the data and allows the most opportunities for optimization.  It's quite easy to align your data with option -align array32byte (newer compilers) AND in the module use the DIR$ ATTRIBUTES ALIGN to tell all USE contexts that the arrays are aligned.  See this article:

These 2 techniques will assist the compiler in optimization.  That said, if you're doing things like small trip counts, indirect memory references, non-predictable strides or have non-countable loops or early exit within a loop there's not much the compiler can do.


thanks a lot, Ron, that's excellent material. I'll start working my way through it. I assume that the compiler can hardly see at compile time whether or not the loop bounds match the array bounds so that it shouldn't matter for optimization if only a subset of an aligned array is treated. Is that correct ?

How about multidimensional arrays then ? Theoretically the -align options should cause a round-up of the first array dimension during ALLOCATE. I believe this would be necessary to maintain any alignment assumptions in nested loops. However that creates unused holes in the array . . . 

The compiler would assume loop lengths matching the declared bounds if those are available at compile time and the actual bounds aren't available.  In the assumed shape case posted here, it would make default assumptions such as length 100, besides not having needed information about strides.

Ron, you mentioned that some information about array arguments to a subprogram is "lost", while module arrays accessed in a subprogram provide a lot of information that the compiler can use for optimization.  Is it any better to pass arguments through derived types?  For example, rather than passing "bare" arrays like this:

subroutine add(a, b, c)
real, dimension(100) :: a, b, c
integer :: i
do i = 1, 100
  c(i) = a(i) + b(i)
end do

would it be better to do this?

type (array)
real :: vals(100)
end type
subroutine add(a, b, c)
type (array)
integer :: i
do i = 1, 100
  c%vals(i) = a%vals(i) + b%vals(i)
end do
end subroutine

I'm assuming that part of the problem with passing arrays is that, even if the programmer makes sure that all arrays are aligned, one could always pass to a subroutine part of an array starting at an unaligned offset.  The compiler thus has to be conservative and, at best, create multiple versions of the code that will be selected from by a run-time alignment test.  In the situation where one is passing a derived type, the compiler is able to assume that the derived type components are aligned to whatever address boundary is desired (via directives or compiler flags).

Is my description correct?  Does ifort presently optimize derived type arguments in this way?

In the case where you declare the bounds of the array as constants, and the loop bounds are constants, that gives the compiler the most information. I would not make it more complicated by putting the array in a derived type - that doesn't help. In cases where the bounds are not constants, there are various directives you can use to give the compiler hints as to loop trip counts, etc. Building with -ipo and/or profile-uided optimization can also provide more information.

Retired 12/31/2016

Coming back to the question why AVX-256 loads are not generated I found the following:

  • For MODULE global arrays it is sufficient to compile with -align array32byte if the array has only 1 dimension. There is no need for using directives then.
  • For N-dimensional ones that does not help. Even if the first dimension is padded such that alignment is guaranteed for any slice and layer, there is no way to tell the compiler about it. For MIC there is such a compile switch, but not for Sandy Bridge.
  • The only way around is to pass 1D slices with assumed shape to subroutines. Inside the subroutine the assume_aligned directive can then be used, which requires appropriate padding in the allocation. 
  • Alternatively one can go back to the stone age and put everything in 1D arrays.
  • So the bottom line is that N-dimensional arrays currently lack full vectorization support.

Please correct me if I'm wrong. Besides that I can strongly recommend the new Intel book on "Xeon Phi High-Performance Programming". Even if one just optimizes for Sandy Bridge it helps a lot to understand how the compiler does things.


Leave a Comment

Please sign in to add a comment. Not a member? Join today