-xHost leads to "loop was not vectorized: unsupported data type"

-xHost leads to "loop was not vectorized: unsupported data type"

Hi,

I encountered a weird problem when I was trying to use SIMD directives and AVX instructions to vectorize my loops.
I wrote the following small program to illustrate this problem:

Program AVX_test

  REAL(kind = 8), Allocatable :: time_g(:), timeOld_g(:), dt_g(:)
  INTEGER :: nCells

  nCells = 1000000

  if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(nCells))
  if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(nCells))
  if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(nCells))

  !DIR$ SIMD
  do i = 1, nCells
    time_g(i) = timeOld_g(i) + dt_g(i)
  end do

End Program AVX_test

Case1: The vector report looks good if I only use -O2 option:

login2$ ifort -O2 -vec-report6 avxtest.f90 -o avxtest
avxtest.f90(14): (col. 5) remark: vectorization support: streaming store was generated for avx_test.
avxtest.f90(14): (col. 5) remark: vectorization support: streaming store was generated for avx_test.
avxtest.f90(14): (col. 32) remark: SIMD LOOP WAS VECTORIZED.

I found addpd instruction in the assembly file for line 14. So I think the loop has been vectorized with SSE2 vector instructions.

Case2: If I further add -xHost to use AVX instructions. The vector report will complain:

login2$ ifort -O2 -xHost -vec-report6 avxtest.f90 -o avxtest
avxtest.f90(14): (col. 5) remark: vectorization support: streaming store was generated for avx_test.
avxtest.f90(14): (col. 32) remark: SIMD LOOP WAS VECTORIZED.
avxtest.f90(14): (col. 32) remark: loop was not vectorized: unsupported data type.
avxtest.f90(14): (col. 32) warning #13379: loop was not vectorized with "simd"

This is confusing to me because it says yes (SIMD LOOP WAS VECTORIZED) and no (unsupported data type). I checked the
assembly file and found vaddpd instruction. But I am just not sure whether this loop has finally been vectorized. The message 
"unsupported data type" is a little bit weird to me. The ifort version is 13.1.0.

Another quick question is that in the assembly files I also found addsd instruction in Case 1 and vaddsd instruction in Case 2 for 
line 14. These should be scalar instructions right? If the loop has been vectorized, why there exist scalar instructions? Is it because
the remainder after loop unrolling?

I would truly appreciate your help and reply.

Best regards,
    Wentao

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Hi,

   Although you see just a single loop, the code generated by the Intel compiler with optimization is more complex.

SIMD load and store instructions are more efficient when the data are aligned in memory, to a 16 byte boundary for Intel SSE, to a 32 byte boundary for Intel AVX, or to a 64 byte boundary for Intel MIC architecture. The compiler does not know the alignment of the allocatable arrays in your program, but it can arrange at runtime for accesses to one variable to be always aligned. For the SSE instruction set, if the first element is aligned to an 8 byte boundary but not a 16 byte one, the compiler executes the first loop iteration in scalar mode. This is sometimes called loop peeling, or the loop prolog. The following memory access will be aligned to a 16 byte boundary; this and subsequent loop iterations will be executed in vector mode using packed SIMD instructions, and constitute the loop kernel. If there is one, unpaired iteration left over at the end of the loop, it too is executed in scalar mode. This is called the remainder or loop epilog. The two movsd instructions that you see corresond to the prolog and epilog. Whether they are actually executed depends on the actual data alignment. You will still see movupd for the memory access instructions in the loop kernel, and not movapd. What counts is whether the actual data are aligned, not the instruction type. movupd (or movapd) on aligned data is faster than movupd on unaligned data.

Because the vector width is greater (32 bytes) for Intel AVX instructions, the prolog and epilog can each contain up to 3 iterations of the original loop; these may be called the peel loop and the remainder loop. In the 13.1 compiler that you are using, with -vec-report6, you see the vectorization messages for the remainder loop in addition to the vectorization messages for the loop kernel. The "SIMD LOOP WAS VECTORIZED" message applies to the loop kernel; the "unsupported data type" and "loop was not vectorized with simd" messages apply to the remainder loop. Most of the time, whether or not the remainder loop gets vectorized is not very important, and the messages may be more confusing than helpful. In more recent compilers, messages relating to remainder loops are either suppressed, or prefixed by the words "REMAINDER LOOP".

Finally, if you align your data and tell the compiler, it will not need to generate peel or remainder loops, it can generate more efficient code assuming alignemnt for all data. The simplest way to align your data is to build with -align array32byte; alternatively, you could use the directive   !DIR$ ATTRIBUTES ALIGN :32 :: time_g, timeOld_g, dt_g

To tell the compiler, the simplest is to put   !DIR$ VECTOR ALIGNED    ahead of !DIR$ SIMD  . This asserts that all array accesses inside the loop are for aligned data.  Alternatively, although I don't think the Intel SIMD directive takes an alignment clause, it's OpenMP 4.0 equivalent does. You should then no longer see movsd instructions. If you use a version 14 compiler, with -vec-report6, you should see a message that says explicitly whether the compiler thinks memory accesses are aligned or may be unaligned, e.g.

> ifort -O2 -vec-report6 avxtest.f90 -c -S -xhost -align array32byte
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: streaming store was generated for avx_test
avxtest.f90(14): (col. 32) remark: SIMD LOOP WAS VECTORIZED

> grep addpd avxtest.s
        vaddpd    8(%rdx,%rsi,8), %ymm0, %ymm1                  #14.5
> grep addsd avxtest.s
>

 

Quote:

Martyn Corden (Intel) wrote:

Hi,

   Although you see just a single loop, the code generated by the Intel compiler with optimization is more complex.

SIMD load and store instructions are more efficient when the data are aligned in memory, to a 16 byte boundary for Intel SSE, to a 32 byte boundary for Intel AVX, or to a 64 byte boundary for Intel MIC architecture. The compiler does not know the alignment of the allocatable arrays in your program, but it can arrange at runtime for accesses to one variable to be always aligned. For the SSE instruction set, if the first element is aligned to an 8 byte boundary but not a 16 byte one, the compiler executes the first loop iteration in scalar mode. This is sometimes called loop peeling, or the loop prolog. The following memory access will be aligned to a 16 byte boundary; this and subsequent loop iterations will be executed in vector mode using packed SIMD instructions, and constitute the loop kernel. If there is one, unpaired iteration left over at the end of the loop, it too is executed in scalar mode. This is called the remainder or loop epilog. The two movsd instructions that you see corresond to the prolog and epilog. Whether they are actually executed depends on the actual data alignment. You will still see movupd for the memory access instructions in the loop kernel, and not movapd. What counts is whether the actual data are aligned, not the instruction type. movupd (or movapd) on aligned data is faster than movupd on unaligned data.

Because the vector width is greater (32 bytes) for Intel AVX instructions, the prolog and epilog can each contain up to 3 iterations of the original loop; these may be called the peel loop and the remainder loop. In the 13.1 compiler that you are using, with -vec-report6, you see the vectorization messages for the remainder loop in addition to the vectorization messages for the loop kernel. The "SIMD LOOP WAS VECTORIZED" message applies to the loop kernel; the "unsupported data type" and "loop was not vectorized with simd" messages apply to the remainder loop. Most of the time, whether or not the remainder loop gets vectorized is not very important, and the messages may be more confusing than helpful. In more recent compilers, messages relating to remainder loops are either suppressed, or prefixed by the words "REMAINDER LOOP".

Finally, if you align your data and tell the compiler, it will not need to generate peel or remainder loops, it can generate more efficient code assuming alignemnt for all data. The simplest way to align your data is to build with -align array32byte; alternatively, you could use the directive   !DIR$ ATTRIBUTES ALIGN :32 :: time_g, timeOld_g, dt_g

To tell the compiler, the simplest is to put   !DIR$ VECTOR ALIGNED    ahead of !DIR$ SIMD  . This asserts that all array accesses inside the loop are for aligned data.  Alternatively, although I don't think the Intel SIMD directive takes an alignment clause, it's OpenMP 4.0 equivalent does. You should then no longer see movsd instructions. If you use a version 14 compiler, with -vec-report6, you should see a message that says explicitly whether the compiler thinks memory accesses are aligned or may be unaligned, e.g.

> ifort -O2 -vec-report6 avxtest.f90 -c -S -xhost -align array32byte
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: reference avx_test has aligned access
avxtest.f90(14): (col. 5) remark: vectorization support: streaming store was generated for avx_test
avxtest.f90(14): (col. 32) remark: SIMD LOOP WAS VECTORIZED

> grep addpd avxtest.s
        vaddpd    8(%rdx,%rsi,8), %ymm0, %ymm1                  #14.5
> grep addsd avxtest.s
>

 

Hi Martyn,

You explanations REALLY help me a lot! Many thanks!

Best regards,
    Wentao

Login to leave a comment.