Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

User-mandated or SIMD Vectorization

User-mandated or SIMD vectorization supplements automatic vectorization just like OpenMP parallelization supplements automatic parallelization. The following figure illustrates this relationship. User-mandated vectorization is implemented as a single-instruction-multiple-data (SIMD) feature and is referred to as SIMD vectorization.

NOTE:

The SIMD vectorization feature is available for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as /arch (Windows), -m (Linux and macOS), or [Q]x.

The following figure illustrates how SIMD vectorization is positioned among various approaches that you can take to generate vector code that exploits vector hardware capabilities. The programs written with SIMD vectorization are very similar to those written using auto-vectorization hints. You can use SIMD vectorization to minimize the number of code changes that you may have to go through in order to obtain vectorized code.

SIMD vectorization uses the #pragma omp simd pragma to effect loop vectorization. You must add this pragma to a loop and recompile to vectorize the loop using the option -qopenmp-simd (Linux and macOS) or Qopenmp-simd (Windows).

Consider an example in C++ where the function add_floats() uses too many unknown pointers for the compiler’s automatic runtime independence check optimization to kick in. You can give a data dependence assertion using the auto-vectorization hint via #pragma ivdep and let the compiler decide whether the auto-vectorization optimization should be applied to the loop. Or you can now enforce vectorization of this loop by using #pragma omp simd .

Vectorization without #pragma omp simd:

[D:/simd] cat example1.c 
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) {
 int i;
 for (i=0; i<n; i++){
  a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
 } 
}

[D:/simd] icl -nologo -c -Qopt-report2 -Qopt-report-file=stderr -Qopt-report-phase=vec -Qopenmp-simd example1.c
example1.c

Begin optimization report for: add_floats(float *, float *, float *, float *, float *, int)

    Report from: Vector optimizations [vec]

LOOP BEGIN at C:\Users\test\run\example1.c(3,2)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between a[i] (4:3) and b[i] (4:3)
LOOP END

LOOP BEGIN at C:\Users\test\run\example1.c(3,2)
<Remainder>
LOOP END
===========================================================================

Vectorization with #pragma omp simd:

[D:/simd] cat example1.c 
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) {
	int i;
	#pragma omp simd
	for (i=0; i<n; i++){
		a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
		} 
}

[D:/simd] icl -nologo -c -Qopt-report2 -Qopt-report-file=stderr -Qopt-report-phase=vec -Qopenmp-simd example1.c
example1.c

Begin optimization report for: add_floats(float *, float *, float *, float *, float *, int)

    Report from: Vector optimizations [vec]

LOOP BEGIN at C:\iUsers\test\run\example1.c(4,2)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at C:\iUsers\test\run\example1.c(4,2)
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at C:\iUsers\test\run\example1.c(4,2)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at C:\iUsers\test\run\example1.c(4,2)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at C:\iUsers\test\run\example1.c(4,2)
<Remainder loop for vectorization>
LOOP END
===========================================================================

The difference between using #pragma omp simd and auto-vectorization hints is that with #pragma omp simd, the compiler generates a warning when it is unable to vectorize the loop. With auto-vectorization hints, actual vectorization is still under the discretion of the compiler, even when you use the #pragma vector always hint.

#pragma omp simd has optional clauses to guide the compiler on how vectorization must proceed. Use these clauses appropriately so that the compiler obtains enough information to generate correct vector code. For more information on the clauses, see the #pragma omp simd description.

Additional Semantics

Note the following points when using the omp simd pragma.

  • A variable may belong to zero or one of the following: private, linear, or reduction.

  • Within the vector loop, an expression is evaluated as a vector value if it is private, linear, reduction, or it has a sub-expression that is evaluated to a vector value. Otherwise, it is evaluated as a scalar value (that is, broadcast the same value to all iterations). Scalar value does not necessarily mean loop invariant, although that is the most frequently seen usage pattern of scalar value.

  • A vector value may not be assigned to a scalar L-value. It is an error.

  • A scalar L-value may not be assigned under a vector condition. It is an error.

  • The switch statement is not supported.

NOTE:

You may find it difficult to describe vector semantics using the SIMD pragma for some auto-vectorizable loops. One example is MIN/MAX reduction in C since the language does not have MIN/MAX operators.

Using vector Declaration

Consider the following C++ example code with a loop containing the math function, sinf().

NOTE:
All code examples in this section are applicable for C/C++ on Windows only.

Loop Where the Math Function Is Auto-vectorized

[D:/simd] cat example2.c 
void vsin(float *restrict a, float *restrict b, int n) { 
int i; 
for (i=0; i<n; i++) {
  a[i] = sinf(b[i]);
  } 
}

[D:/simd] icl -nologo -c -Qrestrict -Qopt-report2 -Qopt-report-file=stderr -Qopt-report-phase=vec example2.c
example2.c

Begin optimization report for: vsin(float *restrict, float *restrict, int)

    Report from: Vector optimizations [vec]


LOOP BEGIN at C:\Users\test\run\example2.c(3,1)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at C:\Users\test\run\example2.c(3,1)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at C:\Users\test\run\example2.c(3,1)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at C:\Users\test\run\example2.c(3,1)
<Remainder loop for vectorization>
LOOP END
===========================================================================

When you compile the above code, the loop with sinf() function is auto-vectorized using the appropriate Short Vector Math Library (SVML) library function provided by the Intel® C++ Compiler. The auto-vectorizer identifies the entry points, matches up the scalar math library function to the SVML function, and invokes it.

However, within this loop if you have a call to your function, foo(), that has the same prototype as sinf(), the auto-vectorizer fails to vectorize the loop because it does not know what foo() does unless it is inlined to this call site.

Loop With User-Defined Function Is Not Auto-vectorized

[D:/simd] cat example2.c 
float foo(float); 
void vfoo(float *restrict a, float *restrict b, int n){
	int i;
	for (i=0; i<n; i++){
		a[i] = foo(b[i]);
		}
}

[D:/simd] icl -nologo -c -Qrestrict -Qopt-report2 -Qopt-report-file=stderr -Qopt-report-phase=vec example2.c
example2.c

Begin optimization report for: vsin(float *restrict, float *restrict, int)

    Report from: Vector optimizations [vec]

Non-optimizable loops:


LOOP BEGIN at C:\Users\test\run\example2.c(3,1)
   remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.
LOOP END

In such cases, you can use the __declspec(vector) (Windows) or __attribute__((vector)) (Linux) declaration to vectorize the loop. All you need to do is add the vector declaration to the function declaration, and recompile both the caller and callee code, and the loop and function are vectorized.

Loop with User-Defined Function with SIMD Declaration Is Vectorized

[D:/simd] cat example3.c 
// foo() and vfoo() do not have to be in the same compilation unit as long 
// as both see the same "#pragma omp declare simd" lines. 
#pragma omp declare simd
float foo(float); 
void vfoo(float *restrict a, float *restrict b, int n){
	int i; 
	for (i=0; i<n; i++) { a[i] = foo(b[i]); } 
} 

float foo(float x) { ... }

[D:/simd] bash-3.2$ icl -nologo -c -Qopenmp-simd -Qrestrict -Qopt-report1 -Qopt-report-file=stderr -Qopt-report-phase=vec example3.c
example3.c

Begin optimization report for: vfoo(float *restrict, float *restrict, int)

    Report from: Vector optimizations [vec]


LOOP BEGIN at C:\Users\test\run\example3.c(7,5)
<Peeled loop for vectorization>
LOOP END

LOOP BEGIN at C:\Users\test\run\example3.c(7,5)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at C:\Users\test\run\example3.c(7,5)
<Alternate Alignment Vectorized Loop>
LOOP END

LOOP BEGIN at C:\Users\test\run\example3.c(7,5)
<Remainder loop for vectorization>
LOOP END
===========================================================================

Begin optimization report for: foo.._simdsimd3__xmm4nv(float)

    Report from: Vector optimizations [vec]

remark #15347: FUNCTION WAS VECTORIZED with xmm, simdlen=4, unmasked, formal parameter types: (vector)
===========================================================================

Begin optimization report for: foo.._simdsimd3__xmm4mv(float)

    Report from: Vector optimizations [vec]

remark #15347: FUNCTION WAS VECTORIZED with xmm, simdlen=4, masked, formal parameter types: (vector)
===========================================================================

Restrictions on Using a #pragma omp declare simd Declaration

Vectorization depends on two major factors: hardware and the style of source code. When using the vector declaration, the following features are not allowed:

  • Thread creation and joining through , OpenMP parallel/for/sections/task/target/teams, and explicit threading API calls.

  • Locks, barriers, atomic construct, critical sections (These are allowed inside #pragma omp ordered simd blocks).

  • Inline ASM code, VM, and Vector Intrinsics (for example, SVML intrinsics).

  • Using setjmp, longjmp, SHE and computed GOTO.

  • EH is not allowed and all vector functions are considered noexcept.

  • The switch statement (in some cases this may be supported and converted to if statements, but this is not reliable).

  • The exit()/abort() calls.

Non-vector function calls are generally allowed within vector functions but calls to such functions are serialized lane-by-lane and so might perform poorly. Also for SIMD-enabled functions it is not allowed to have side effects except writes by their arguments. This rule can be violated by non-vector function calls, so be careful executing such calls in SIMD-enabled functions.

Formal parameters must be of the following data types:

  • (un)signed 8, 16, 32, or 64-bit integer
  • 32- or 64-bit floating point
  • 64- or 128-bit complex
  • A pointer (C++ reference is considered a pointer data type)