Stencil loop is not AVX vectorized

Stencil loop is not AVX vectorized

The attached example is vectorized in a sense of SSE2 but not in a sense of AVX. There are no ymm register references in the assembly source. When I replace the X-1 index with X in line 13 then I get true AVX code. Is there something I overlooked ? Thanks for any hints.

michael

The command line is

/opt/intel/composer_xe_2013.2.146/bin/intel64/ifort -r8 -O3 -xAVX -vec-report6 -c -S test.f90

AllegatoDimensione
Download test.f90807 byte
11 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

The current compiler (13.1.2.183) reports 2 versions of vectorized loop.

The compiler appears to have concluded that AVX-128 will be better optimized for Sandy Bridge than AVX-256, on account of the number of misalignments and variable strides.  CORE-AVX2 will generate AVX-256, with or without -no-fma.

Are you using assumed shape intentionally?

Your ix-1 and iy-1 references appear to be out of bounds; the -c and -S options conflict.

Tim,

thanks, you are right about the bounds violation, the loops should start with 2 instead of 1.

Regarding shape, in the real program the arrays are module global ones. Does that make a difference ?

I would hope that the compiler has better information available in a module procedure than with assumed shape.  Maybe it wouldn't require 2 versions of generated code.

I still would not be entirely surprised to see a choice of AVX-128.  I don't know whether Ivy Bridge CPUs, which have better support for 256-bit unaligned loads, would make those worth considering, but I don't think the compilers make such a distinction.

It is interesting how argument passing affects optimizations.  Our compiler developers expose exactly how passing arrays affects optimization in this document:  http://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization.

Putting data in modules and USE provides the compiler with the most information about the data and allows the most opportunities for optimization.  It's quite easy to align your data with option -align array32byte (newer compilers) AND in the module use the DIR$ ATTRIBUTES ALIGN to tell all USE contexts that the arrays are aligned.  See this article:  http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

These 2 techniques will assist the compiler in optimization.  That said, if you're doing things like small trip counts, indirect memory references, non-predictable strides or have non-countable loops or early exit within a loop there's not much the compiler can do.

ron

thanks a lot, Ron, that's excellent material. I'll start working my way through it. I assume that the compiler can hardly see at compile time whether or not the loop bounds match the array bounds so that it shouldn't matter for optimization if only a subset of an aligned array is treated. Is that correct ?

How about multidimensional arrays then ? Theoretically the -align options should cause a round-up of the first array dimension during ALLOCATE. I believe this would be necessary to maintain any alignment assumptions in nested loops. However that creates unused holes in the array . . . 

The compiler would assume loop lengths matching the declared bounds if those are available at compile time and the actual bounds aren't available.  In the assumed shape case posted here, it would make default assumptions such as length 100, besides not having needed information about strides.

Ron, you mentioned that some information about array arguments to a subprogram is "lost", while module arrays accessed in a subprogram provide a lot of information that the compiler can use for optimization.  Is it any better to pass arguments through derived types?  For example, rather than passing "bare" arrays like this:

subroutine add(a, b, c)
real, dimension(100) :: a, b, c
integer :: i
do i = 1, 100
  c(i) = a(i) + b(i)
end do

would it be better to do this?

type (array)
real :: vals(100)
end type
...
subroutine add(a, b, c)
type (array)
integer :: i
do i = 1, 100
  c%vals(i) = a%vals(i) + b%vals(i)
end do
end subroutine

I'm assuming that part of the problem with passing arrays is that, even if the programmer makes sure that all arrays are aligned, one could always pass to a subroutine part of an array starting at an unaligned offset.  The compiler thus has to be conservative and, at best, create multiple versions of the code that will be selected from by a run-time alignment test.  In the situation where one is passing a derived type, the compiler is able to assume that the derived type components are aligned to whatever address boundary is desired (via directives or compiler flags).

Is my description correct?  Does ifort presently optimize derived type arguments in this way?

In the case where you declare the bounds of the array as constants, and the loop bounds are constants, that gives the compiler the most information. I would not make it more complicated by putting the array in a derived type - that doesn't help. In cases where the bounds are not constants, there are various directives you can use to give the compiler hints as to loop trip counts, etc. Building with -ipo and/or profile-uided optimization can also provide more information.

Steve

Coming back to the question why AVX-256 loads are not generated I found the following:

  • For MODULE global arrays it is sufficient to compile with -align array32byte if the array has only 1 dimension. There is no need for using directives then.
  • For N-dimensional ones that does not help. Even if the first dimension is padded such that alignment is guaranteed for any slice and layer, there is no way to tell the compiler about it. For MIC there is such a compile switch, but not for Sandy Bridge.
  • The only way around is to pass 1D slices with assumed shape to subroutines. Inside the subroutine the assume_aligned directive can then be used, which requires appropriate padding in the allocation. 
  • Alternatively one can go back to the stone age and put everything in 1D arrays.
  • So the bottom line is that N-dimensional arrays currently lack full vectorization support.

Please correct me if I'm wrong. Besides that I can strongly recommend the new Intel book on "Xeon Phi High-Performance Programming". Even if one just optimizes for Sandy Bridge it helps a lot to understand how the compiler does things.

Michael

Accedere per lasciare un commento.