Developer Guide and Reference

Contents

Vectorization and Loops

This topic provides more information on the interaction between the auto-vectorizer and loops.
In some rare cases, a successful loop parallelization
(either automatically or by means of OpenMP* directives)
may affect the messages reported by the compiler for a non-vectorizable loop in a non-intuitive way.

Types of Vectorized Loops

For integer loops, the 128-bit Intel® Streaming SIMD Extensions (Intel® SSE) and the Intel® Advanced Vector Extensions (Intel® AVX) provide SIMD instructions for most arithmetic and logical operators on 32-bit, 16-bit, and 8-bit integer data types, with limited support for the 64-bit integer data type.
Vectorization may proceed if the final precision of integer wrap-around arithmetic is preserved. A 32-bit shift-right operator, for instance, is not vectorized in 16-bit mode if the final stored value is a 16-bit integer. Also, note that because the Intel® SSE and the Intel® AVX instruction sets are not fully orthogonal (shifts on byte operands, for instance, are not supported), not all integer operations can actually be vectorized.
For loops that operate on 32-bit single-precision and 64-bit double-precision floating-point numbers, Intel® SSE provides SIMD instructions for the following arithmetic operators:
  • addition (+)
  • subtraction (-)
  • multiplication (*)
  • division (/)
Additionally, Intel® SSE provide SIMD instructions for the binary
MIN
and
MAX
and unary
SQRT
operators. SIMD versions of several other mathematical operators (like the trigonometric functions
SIN
,
COS
, and
TAN
) are supported in software in a vector mathematical run-time library that is provided with the
Intel® oneAPI
DPC++/C++
Compiler
.
To be vectorizable, loops must be:
  • Countable:
    The loop trip count must be known at entry to the loop at runtime, though it need not be known at compile time (that is, the trip count can be a variable but the variable must remain constant for the duration of the loop). This implies that exit from the loop must not be data-dependent.
  • Single entry and single exit:
    as is implied by stating that the loop must be countable.
  • Contain straight-line code:
    SIMD instruction perform the same operation on data elements from multiple iterations of the original loop, therefore, it is not possible for different iterations to have different control flow; that is, they must not branch. It follows that
    switch
    statements are not allowed. However,
    if
    statements are allowed if they can be implemented as masked assignments, which is usually the case. The calculation is performed for all data elements but the result is stored only for those elements for which the mask evaluates to true.
  • Innermost loop of a nest:
    The only exception is if an original outer loop is transformed into an inner loop as a result of some other prior optimization phase, such as unrolling, loop collapsing or interchange,
    or an original outermost loop is transformed to an innermost loop due to loop materialization
    .
  • Without function calls:
    Even a
    print
    statement is sufficient to prevent a loop from getting vectorized. The vectorization report message is typically:
    non-standard loop is not a vectorization candidate
    . The two major exceptions are for intrinsic math functions and for functions that may be inlined.
Intrinsic math functions are allowed, because the compiler runtime library contains vectorized versions of these functions. See the table below for a list of these functions; most exist in both float and double versions.
acos
ceil
fabs
round
acosh
cos
floor
sin
asin
cosh
fmax
sinh
asinh
erf
fmin
sqrt
atan
erfc
log
tan
atan2
erfinv
log10
tanh
atanh
exp
log2
trunc
cbrt
exp2
pow

Statements in the Loop Body

The vectorizable operations are different for floating-point and integer data.
Integer Array Operations
The statements within the loop body may contain
char
,
unsigned char
,
short
,
unsigned short
,
int
, and
unsigned int
. Calls to functions such as
sqrt
and
fabs
are also supported. Arithmetic operations are limited to addition, subtraction, bitwise
AND
,
OR
, and
XOR
operators, division (via run-time library call), multiplication,
min
, and
max
. You can mix data types but this may potentially cost you in terms of lowering efficiency. Some example operators where you can mix data types are multiplication, shift, or unary operators.

Other Operations

No statements other than the preceding floating-point and integer operations are allowed. In particular, note that the special
__m64
__m128
, and
__m256
data types are not vectorizable. The loop body cannot contain any function calls. Use of Intel® SSE intrinsics ( for example,
_mm_add_ps
) or Intel® AVX intrinsics (for example,
_mm256_add_ps
) are not allowed.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Notice revision #20201201

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.