Developer Guide and Reference

Contents

Programming Guidelines for Vectorization

The goal of including the vectorizer component in the Intel®
C++
Compiler is to exploit single-instruction multiple data (SIMD) processing automatically. Users can help by supplying the compiler with additional information; for example, by using auto-vectorizer hints or
pragmas
.
Using this option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors.
The vectorization can also be affected by certain options, such as
/arch
(Windows*),
-m
(Linux*
and
macOS*
), or
[Q]x
.

Guidelines to Vectorize Innermost Loops

Follow these guidelines to vectorize innermost loop bodies.
Use:
  • straight-line code (a single basic block)
  • vector data only; that is, arrays and invariant expressions on the right hand side of assignments.
    Array references can appear on the left hand side of assignments.
  • only assignment statements.
Avoid:
  • function calls (other than math library calls)
  • non-vectorizable operations (either because the loop cannot be vectorized, or because an operation is emulated through a number of instructions)
  • mixing vectorizable types in the same loop (leads to lower resource utilization)
  • data-dependent loop exit conditions (leads to loss of vectorization)
To make your code vectorizable, you will often need to make some changes to your loops. You should only make changes needed to enable vectorization, and avoid these common changes:
  • loop unrolling, which the compiler performs automatically
  • decomposing one loop with several statements in the body into several single-statement loops

Restrictions

There are a number of restrictions that you should consider. Vectorization depends on two major factors: hardware and style of source code.
Factor
Description
Hardware
The compiler is limited by restrictions imposed by the underlying hardware. In the case of Intel® Streaming SIMD Extensions (Intel® SSE), the vector memory operations are limited to stride-1 accesses with a preference to 16-byte-aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it for a distinct target architecture.
Style of source code
The style in which you write source code can inhibit vectorization. For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations.
Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures. The ambiguity arises from the complexity of the keywords, operators, data references, pointer arithmetic, and memory operations within the loop bodies.
By understanding these limitations and by knowing how to interpret diagnostic messages, you can modify your program to overcome the known limitations and enable effective vectorization.

Guidelines for Writing Vectorizable Code

Follow these guidelines to write vectorizable code:
  • Use simple
    for
    loops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices.
  • Write straight-line code. Avoid branches such as
    switch
    ,
    goto
    , or
    return
    statements;
    most function calls; or
    if
    constructs that can not be treated as masked assignments.
  • Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies.
  • Try to use array notations instead of the use of pointers.
    C programs in particular impose very few restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies.
    Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.
  • Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate counter for use as an array address.
  • Access memory efficiently:
    • Favor inner loops with unit stride.
    • Minimize indirect addressing.
    • Align your data to 16-byte boundaries (for Intel® SSE instructions).
  • Choose a suitable data layout with care. Most multimedia extension instruction sets are rather sensitive to alignment. The data movement instructions of Intel® SSE, for example, operate much more efficiently on data that is aligned at a 16-byte boundary in memory. Therefore, the success of a vectorizing compiler also depends on its ability to select an appropriate data layout which, in combination with code restructuring (like loop peeling), results in aligned memory accesses throughout the program.
  • Use aligned data structures: Data structure alignment is the adjustment of any data object in relation with other objects.
    You can use the declaration
    __declspec(align)
    .
    Use this hint with care. Incorrect usage of aligned data movements result in an exception when using Intel® SSE.
  • Use structure of arrays (SoA) instead of array of structures (AoS): An array is the most common type of data structure that contains a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation it can be a hindrance for use of vector processing. To make vectorization of the resulting code more effective, you can also select appropriate data structures.

Dynamic Alignment Optimizations

Dynamic alignment optimizations can improve the performance of vectorized code, especially for long trip count loops. Disabling such optimizations can decrease performance, but it may improve bitwise reproducibility of results, factoring out data location from possible sources of discrepancy.
To enable or disable dynamic data alignment optimizations, specify the option
Qopt-dynamic-align[-]
(Windows) or
[no-]qopt-dynamic-align[-]
(Linux).

Using Aligned Data Structures

Data structure alignment is the adjustment of any data object with relation to other objects. The Intel®
C++
Compiler may align individual variables to start at certain addresses to speed up memory access. Misaligned memory accesses can incur large performance losses on certain target processors that do not support them in hardware.
Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called 'naturally aligned' if its address is aligned to its size, otherwise it is called 'misaligned'. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to eight (8).
A data structure is a way of storing data in a computer so that it can be used efficiently. Often, a carefully chosen data structure allows a more efficient algorithm to be used. A well-designed data structure allows a variety of critical operations to be performed, using as little resources - both execution time and memory space - as possible.
Example
struct MyData{ short Data1; short Data2; short Data3; };
In the example data structure above, if the type
short
is stored in two bytes of memory then each member of the data structure is aligned to a boundary of two bytes.
Data1
would be at offset
0
,
Data2
at offset
2
and
Data3
at offset
4
. The size of this structure is six bytes. The type of each member of the structure usually has a required alignment, meaning that it is aligned on a pre-determined boundary, unless you request otherwise. In cases where the compiler has taken sub-optimal alignment decisions, you can use the declaration
declspec(align(base,offset))
, where
0 <= offset < base
and
base
is a power of two, to allocate a data structure at offset from a certain base.
Consider as an example, that most of the execution time of an application is spent in a loop of the following form:
Example