Requirements for Vectorizable Loops

Vectorization of Loops

For the Intel® C/C++ and Fortran compilers for IA-32 or Intel 64, “vectorization” of a loop means unrolling the loop so that it can take advantage of packed SIMD instructions to perform the same operation on multiple data elements in a single instruction. For example, where a non-vectorized "DAXPY" loop

for (i=0;i<MAX;i++) z[i]=a*x[i]+y[i]; 

might use scalar SIMD instructions such as addsd and mulsd, a vectorized loop would use the packed versions, addpd or mulpd. (In the penultimate character, s stands for “scalar” and p stands for “packed”. In the final character, s stands for single precision and d stands for double). In the most recent Intel compilers, vectorization is one of many optimizations that are enabled by default.

Vectorization can be thought of as executing more than one consecutive iteration of the original loop at the same time. For processors supporting Streaming SIMD Extensions, this is usually 2 or 4 iterations, but potentially could be more, especially for integer arithmetic or for more advanced instruction sets. This leads to some restrictions on the types of loop that can be vectorized. Additional requirements for effective vectorization come from the properties of the SIMD instructions themselves.

Requirements for loop vectorization:

• The loop must contain straight-line code (a single basic block). There should be no jumps or branches, but masked assignments are allowed, including if-then-else constructs that can be interpreted as masked assignments.
• The loop must be countable, i.e. the number of iterations must be known before the loop starts to execute, though it need not be known at compile time. Consequently, there must be no data-dependent exit conditions.
• There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2 for correct results. This allows consecutive iterations of the original loop to be executed simultaneously in a single iteration of the unrolled, vectorized loop.

OK (vectorizable):  a[i-1] is always computed before it is used:

for (i=1; i<MAX; i++) {
   a[i] = b[i] + c[i]
   d[i] = e[i] – a[i-1]
}


Not OK (unvectorizable): a[i-1] might be needed before it has been computed:

for (i=1; i<MAX; i++) {
   d[i] = e[i] – a[i-1]
   a[i] = b[i] + c[i]
}

• There should be no special operators and no function or subroutine calls, unless these are inlined, either manually or automatically by the compiler, or they are SIMD (vectorized) functions. Intrinsic math functions such as sin(), log(), fmax(), etc. are allowed, since the compiler runtime library contains SIMD (vectorized) versions of these functions. See the comments section for a more extensive list.

• If a loop is part of a loop nest, it should normally be the inner loop. Outer loops can be parallelized using OpenMP or autoparallelization (–parallel), but they can only rarely be auto-vectorized, unless the compiler is able either to fully unroll the inner loop, or to interchange the inner and outer loops. (Additional high level loop transformations such as these may require –O3. This option is available for both Intel® and non-Intel microprocessors but it may result in more optimizations for Intel microprocessors than for non-Intel microprocessors). The SIMD pragma or directive can be used to ask the compiler to vectorize an outer loop. See http://software.intel.com/en-us/articles/requirements-for-vectorizing-loops-with-pragma-simd for more information about what sort of loops can be vectorized using #pragma simd, !DIR$ SIMD or the OpenMP 4.0 equivalents.


Advice:

• Both reductions and vector assignments to arrays are allowed.
• Try to avoid mixing vectorizable data types in the same loop (except for integer arithmetic on array subscripts). Vectorization of type conversions may be  inefficient.
• Try to access contiguous memory locations. (So loop over the first array index in Fortran, or the last array index in C). Whilst the compiler is often able to vectorize loops with indirect or non-unit stride memory addressing, the cost of gathering data from or scattering back to memory may be too great to make vectorization worthwhile.
• The “ivdep” pragma or directive may be used to advise the compiler that there are no loop-carried dependencies that would make vectorization unsafe.
• The “vector always” pragma or directive may be used to override the compiler’s heuristics that determine whether vectorization of a loop is likely to yield a performance benefit. This pragma does not override the compiler's dependency analysis.
• To see whether a loop was or was not vectorized, and why, look at the vectorization report. This may be enabled by the command line switch /Qvec-report2 (Windows*) or –vec-report2 (Linux* or Mac OS* X). Additional information may be obtained by increasing the report level from 2 to 3 or 6.
• Explicit Vector Programming can make the vectorization of loops more predictable, through the use of Intel® CilkTM Plus array notation, SIMD functions and SIMD pragmas and directives.

• For more information, see the main Intel Compiler documentation, under “Key Features”, “Automatic Vectorization”.

 

Optimization Notice in English

Теги:
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.