Developer Guide and Reference

Contents

Using Automatic Vectorization

Automatic vectorization is supported on IA-32 and Intel® 64 architectures. The information below will guide you in setting up the auto-vectorizer.

Vectorization Speed-up

Where does the vectorization speedup come from? Consider the following sample code fragment, where
a
,
b
and
c
are integer arrays:
Sample Code Fragment
for (I=0;i<=MAX;i++) c[i]=a[i]+b[i];
If vectorization is not enabled, that is, you compile using the
O1
or
-no-vec-
(or
/Qvec-
)
option, for each iteration, the compiler processes the code such that there is a lot of unused space in the SIMD registers, even though each of the registers could hold three additional integers. If vectorization is enabled (compiled using
O2
or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (
O2
) or higher.
Using this option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors.
The vectorization can also be affected by certain options, such as
/arch
(Windows*),
-m
(Linux*
and
macOS*
), or
[Q]x
.
To allow comparisons between vectorized and not-vectorized code, disable vectorization using the
/Qvec-
(Windows*) or
-no-vec
(Linux*
or
macOS*
) option; enable vectorization using the
O2
option.
To get information on whether a loop was vectorized or not, enable generation of the optimization report using the options
Qopt-report:1
Qopt-report-phase:vec
(Windows) or
qopt-report=1
qopt-report-phase=vec
(Linux
and
macOS*
) options. These options generate a separate report in an
*.optrpt
file that includes optimization messages. In Visual Studio, the program source is annotated with the report's messages, or you can read the resulting
.optrpt
file using a text editor. A message appears for every loop that is vectorized, such as:
Example: Vectorization Report
> icl /Qopt-report:1 /Qopt-report-phase:vec Multiply.c Multiply.c(92): (col. 5) remark: LOOP WAS VECTORIZED.
The source line number (
92
in the above example) refers to either the beginning or the end of the loop.
To get details about the type of loop transformations and optimizations that took place, use the
[Q]opt-report-phase
option by itself or along with the
[Q]opt-report
option.
To get information on whether the loop was vectorized using the Visual Studio* IDE, select
Project
Properties
C/C++
Diagnostics
Optimization Diagnostic Level
as
Level 1 (/Qopt-report:1)
and
Optimization Diagnostic Phase
as
Loop Nest Optimization (/Qopt-report-phase:loop)
. To get a diagnostic message for every loop that was not vectorized, with a brief explanation of why the loop was not vectorized, select
/Qopt-report-phase:vec
.
How significant is the performance enhancement? To evaluate performance enhancement yourself, run
vec_samples
:
  1. Open an Intel® Compiler command line window.
    • On Windows*:
      Under the
      Start
      menu item for your Intel product, select an icon under
      Compiler and Performance Libraries
      Command Prompt with Intel Compiler
    • On Linux*
      and
      macOS*
      :
      Source an environment script such as
      compilervars.sh
      or the
      compilervars.csh
      in the
      <installdir>/bin
      directory and use the attribute appropriate for the architecture.
  2. Navigate to the
    <install-dir>\Samples\<locale>\C++\
    directory. On Windows, unzip the sample project
    vec_samples.zip
    to a writable directory. This small application multiplies a vector by a matrix using the following loop:
    Example: Vector Matrix Multiplication
    for (j = 0;j < size2; j++) { b[i] += a[i][j] * x[j]; }
  3. Build and run the application, first without enabling auto-vectorization. The default
    O2
    optimization enables vectorization, so you need to disable it with a separate option. Note the time taken for the application to run.
    Example: Building and Running an Application without Auto-vectorization
    // (Linux*
    and
    macOS*
    )
    icc -O2 -no-vec Multiply.c -o NoVectMult ./NoVectMult
    // (Windows*)
    icl /O2 /Qvec- Multiply.c /FeNoVectMult NoVectMult
  4. Now build and run the application, this time with auto-vectorization. Note the time taken for the application to run.
    Example: Building and Running an Application with Auto-vectorization
    // (Linux*
    and
    macOS*
    )
    vicc -O2 -qopt-report=1 -qopt-report-phase=vec Multiply.c -o VectMult ./VectMult
    // (Windows*)
    icl /O2 /Qopt-report:1 /Qopt-report-phase:vec Multiply.c /FeVectMult VectMult
When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the
O1
option.

Obstacles to Vectorization

The following do not always prevent vectorization, but frequently either prevent it or cause the compiler to decide that vectorization would not be worthwhile.
  • Non-contiguous memory access:
    Four consecutive integers or floating-point values, or two consecutive doubles, may be loaded directly from memory in a single SSE instruction. But if the four integers are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, as in the examples below. The compiler rarely vectorizes such loops, unless the amount of computational work is large compared to the overhead from non-contiguous memory access.
    Example: Non-contiguous Memory Access
    // arrays accessed with stride 2 for (int I=0; i<SIZE; I+=2) b[i] += a[i] * x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) { for (int I=0; i<SIZE; I++) b[i] += a[i][j] * x[j]; } // indirect addressing of x using index array for (int I=0; i<SIZE; I+=2) b[i] += a[i] * x[index[i]];
    The typical message from the vectorization report is:
    vectorization possible but seems inefficient
    , although indirect addressing may also result in the following report:
    Existence of vector dependence.