Requirements for Vectorizable Loops

Теги:
Vectorization of Loops

For the Intel® C/C++ and Fortran compilers for IA-32 or Intel 64, “vectorization” of a loop means unrolling the loop so that it can take advantage of packed SIMD instructions to perform the same operation on multiple data elements in a single instruction. For example, where a non-vectorized "DAXPY" loop
for (i=0;i<MAX;i++) z[i]=a*x[i]+y[i]; 
might use scalar SIMD instructions such as addsd and mulsd, a vectorized loop would use the packed versions, addpd or mulpd. (In the penultimate character, s stands for “scalar” and p stands for “packed”. In the final character, s stands for single precision and d stands for double). In the most recent Intel compilers, vectorization is one of many optimizations that are enabled by default.

Vectorization can be thought of as executing more than one consecutive iteration of the original loop at the same time. For processors supporting Streaming SIMD Extensions, this is usually 2 or 4 iterations, but potentially could be more, especially for integer arithmetic or for future instruction sets. This leads to some restrictions on the types of loop that can be vectorized. Additional requirements for effective vectorization come from the properties of the SIMD instructions themselves.

Requirements for loop vectorization:

• If a loop is part of a loop nest, it must be the inner loop. Outer loops can be parallelized using OpenMP or autoparallelization (–parallel), but they cannot be vectorized unless the compiler is able either to fully unroll the inner loop, or to interchange the inner and outer loops. (Additional high level loop transformations such as these may require –O3. This option is available for both Intel® and non-Intel microprocessors but it may result in more optimizations for Intel microprocessors than for non-Intel microprocessors).
• The loop must contain straight-line code (a single basic block). There should be no jumps or branches, but masked assignments are allowed.
• The loop must be countable, i.e. the number of iterations must be known before the loop starts to execute, though it need not be known at compile time. Consequently, there must be no data-dependent exit conditions.
• There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2 for correct results. This allows consecutive iterations of the original loop to be executed simultaneously in a single iteration of the unrolled, vectorized loop.

OK (vectorizable):  a[i-1] is always computed before it is used:
for (i=1; i<MAX; i++) {
   a[i] = b[i] + c[i]
   d[i] = e[i] – a[i-1]
}

Not OK (unvectorizable): a[i-1] might be needed before it has been computed:
for (i=1; i<MAX; i++) {
   d[i] = e[i] – a[i-1]
   a[i] = b[i] + c[i]
}
• There should be no special operators and no function or subroutine calls, unless these are inlined, either manually or automatically by the compiler. Intrinsic math functions such as sin(), log(), fmax(), etc. are allowed, since the compiler runtime library contains vectorized versions of these functions. See the comments section for a more extensive list.

Advice:

• Both reductions and vector assignments to arrays are allowed.
• Try to avoid mixing vectorizable data types in the same loop (except for integer arithmetic on array subscripts). Vectorization of type conversions may be either unsupported or inefficient. Support for the vectorization of loops containing mixed data types may be extended in a future version of the Intel compiler.
• Try to access contiguous memory locations. (So loop over the first array index in Fortran, or the last array index in C). Whilst the compiler may sometimes be able to vectorize loops with indirect or non-unit stride memory addressing, the cost of gathering data from or scattering back to memory is often too great to make vectorization worthwhile.
• The “ivdep” pragma or directive may be used to advise the compiler that there are no loop-carried dependencies that would make vectorization unsafe.
• The “vector always” pragma or directive may be used to override the compiler’s heuristics that determine whether vectorization of a loop is likely to yield a performance benefit.
• To see whether a loop was or was not vectorized, and why, look at the vectorization report. This may be enabled by the command line switch /Qvec-report3 (Windows*) or –vec-report3 (Linux* or Mac OS* X).

• For more information, see the main Intel Compiler documentation, under “Optimizing Applications”, “Using Parallelism: Automatic Vectorization”.


Optimization Notice in English

Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.