Requirements for Vectorizing Loops with #pragma SIMD

Loop vectorization is a key to enhanced performance on Intel® Architecture, that grows in importance as the SIMD vector length increases. The types of loop that can be vectorized automatically by the Intel® C/C++ and Fortran compilers are described in the article /en-us/articles/requirements-for-vectorizable-loops. Various coding techniques, pragmas and command line options are available to help the compiler to vectorize, as described in the Intel Compiler user Guides and the article /en-us/ . A recent addition to the programmer's toolbox is the SIMD pragma or directive, a component of Intel® CilkTM Plus. This pragma, described in the compiler user guide, version 12 and later, asks the compiler to relax some of the above requirements and to make every possible effort to vectorize a loop. If an ASSERT clause is present, the compilation will fail if the loop is not successfully vectorized. This has led to the nickname "vectorize or die" pragma.

#pragma simd  (!DIR$ SIMD  for Fortran) behaves somewhat like a combination of #pragma vector always  and  #pragma ivdep, but is more powerful. The compiler does not try to assess whether vectorization is likely to lead to performance gain, it does not check for aliasing or dependencies that might cause incorrect results after vectorization, and it does not protect against illegal memory references. #pragma ivdep overrides potential dependencies, but the compiler still performs a dependency analysis, and will not vectorize if it finds a proven dependency that would affect results. With #pragma simd, the compiler does no such analysis, and tries to vectorize regardless. It is the programmer's responsibility to ensure that there are no backward dependencies that might impact correctness. The semantics of #pragma simd are rather similar to those of the OpenMP* pragma,  #pragma omp parallel for. It accepts optional clauses such as REDUCTION, PRIVATE, FIRSTPRIVATE and LASTPRIVATE. SIMD specific clauses are VECTORLENGTH (implies the loop unroll factor), and LINEAR, which can specify different strides for different variables. Pragma SIMD allows a wider variety of loops to be vectorized, including loops containing multiple branches or function calls. It is particularly powerful in conjunction with the vector functions of Intel Cilk Plus (see /en-us/articles/getting-started-with-intel-cilk-plus-simd-vectorization-and-elemental-functions).

           Nevertheless, the technology underlying the SIMD pragma/directive is still that of the compiler vectorizer, and some restrictions remain on what types of loop can be vectorized:

••      The loop must be countable, i.e. the number of iterations must be known before the loop starts to execute, though it need not be known at compile time. Consequently, there must be no data-dependent exit conditions, such as break (C/C++) or EXIT (Fortran) statements. This also excludes most "while" loops.  Typical diagnostics:

     error: invalid simd pragma

     warning #8410: Directive SIMD must be followed by counted DO loop.

••      Certain special, non-mathematical operators are not supported, and also certain combinations of operators and of data types, with diagnostic messages such as       
     "operation not supported", "unsupported reduction", "unsupported data type".

••      Very complex array subscripts or pointer arithmetic may not be vectorized, a typical diagnostic message is "dereference too complex". 

••      Loops with very low trip counts may not be vectorized. Typical diagnostic:

remark: loop was not vectorized: low trip count. 

••      Extremely large loop bodies (very many lines and symbols) may not be vectorized. The compiler has internal limits that prevent it from vectorizing loops that would require a very large number of vector registers, with many spills and restores to and from memory. 

••      SIMD directives may not be applied to Fortran 90 array assignments or to Intel Cilk Plus array notation.

••      SIMD directives may not be applied to loops containing C++ exception handling code.

A number of the requirements detailed in "Requirements for Vectorizable Loops" are relaxed for #pragma simd, in addition to the above-mentioned ones relating to dependencies and performance estimates. Non-inner loops may be vectorized in certain cases; more mixing of different data types is allowed; function calls are possible and more complex control flow is supported. Nevertheless, the advice in the above article should be followed where possible, since it is likely to improve performance.

Side effects:  with #pragma simd, loops are vectorized under the "fast" floating-point model, corresponding to /fp:fast (-fp-model=fast). The command line option /fp:precise (-fp-model precise) is not respected by a loop vectorized with #pragma simd; such a loop might not give identical results to a loop without #pragma simd. For further information about the floating-point model, see /en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler.

For further information about the SIMD pragma/directive and about Intel Cilk Plus,, see the Intel Compiler User Guide.
有关编译器优化的更完整信息,请参阅优化通知