Developer Guide and Reference

Contents

SIMD-Enabled Functions

SIMD-enabled functions (formerly called elemental functions) are a general language construct to express a data parallel algorithm. A SIMD-enabled function is written as a regular C/C++ function, and the algorithm within describes the operation on one element, using scalar syntax. The function can then be called as a regular C/C++ function to operate on a single element or it can be called in a data parallel context to operate on many elements.
If you are using SIMD-enabled functions and need to link a compiler object file with an object file from a previous version of the compiler (for example, 13.1), you need to use the
[Q]vecabi
compiler option, specifying the
legacy
keyword.

How SIMD-Enabled Functions Work

When you write a SIMD-enabled function, the compiler generates short vector variants of the function that you requested, which can perform your function's operation on multiple arguments in a single invocation. The short vector variant may be able to perform multiple operations as fast as the regular implementation performs a single one by utilizing the vector instruction set architecture (ISA) in the CPU. When a call to a SIMD-enabled function occurs in a SIMD loop or another SIMD-enabled function, the compiler replaces the scalar call with the best fit from the available short-vector variants of the function.
In addition, when invoked from a
pragma omp
construct, the compiler may assign different copies of the SIMD-enabled functions to different threads (or workers), executing them concurrently. The end result is that your data parallel operation executes on the CPU utilizing both the parallelism available in the multiple cores and the parallelism available in the vector ISA. In other words, if the short vector function is called inside a parallel loop, an auto-parallelized loop that is vectorized, you can achieve both vector-level and thread-level parallelism.

Declaring a SIMD-Enabled Function

In order for the compiler to generate the short vector function, you need to use the appropriate syntax from below in your code:
Windows*:
Use the
__declspec(vector (
clauses
))
declaration, as follows:
__declspec(vector (
clauses
))
return_type simd_enabled_function_name
(
parameters
)
Linux*
and
macOS*
:
Use the
__attribute__((vector (
clauses
)))
declaration, as follows:
__attribute__((vector (
clauses
)))
return_type
simd_enabled_function_name
(
parameters
)
Alternately, you can use the following OpenMP* pragma, which requires the
[q or Q]openmp
or
[q or Q]openmp-simd
compiler option:
#pragma omp declare simd
clauses
The clauses in the vector declaration may be used for achieving better performance by overriding defaults. These clauses at SIMD-enabled function definition declare one or several short vector variants for a SIMD-enabled functions. Multiple vector declarations with different set of clauses may be attached to one function in order to declare multiple different short vector variants available for a SIMD-enabled function.
The clauses are defined as follows:
processor(cpuid)
Tells the compiler to generate a vector variant using the instructions, the caller/callee interface, and the default vectorlength selection scheme suitable to the specified processor. Use of this clause is highly recommended, especially for processors with wider vector register support (i.e.,
core_2nd_gen_avx
and newer).
cpuid
takes one of the following values:
  • core_4th_gen_avx_tsx
  • core_4th_gen_avx
  • core_3rd_gen_avx
  • core_2nd_gen_avx
  • core_aes_pclmulqdq
  • core_i7_sse4_2
  • atom
  • core_2_duo_sse4_1
  • core_2_duo_ssse3
  • pentium_4_sse3
  • pentium_m
  • pentium_4
  • haswell
  • broadwell
  • skylake
  • skylake_avx512
  • knl
  • knm
vectorlength(n)
/
simdlen(n)
(for
omp declare simd
)
Where
n
is a vector length that is a power of 2, no greater than 32.
The
simdlen
clause tells the compiler that each routine invocation at the call site should execute the computation equivalent to
n
times the scalar function execution. When omitted the compiler selects the vector length automatically depending on the routine return value, parameters, and/or the processor clause. When multiple vector variants are called from one vectorization context (for example, two different functions called from the same vector loop), explicit use of identical
simdlen
values are advised to achieve good performance
linear(list_item[, list_item...])
where
list_item
is one of:
param[:step]
,
val(param[:step])
,
ref(param[:step])
, or
uval(param[:step])
The
linear
clause tells the compiler that for each consecutive invocation of the routine in a serial execution, the value of
param
is incremented by
step
, where
param
is a formal parameter of the specified function or the C++ keyword
this
. The
linear
clause can be used on parameters that are either scalar (non-arrays and of non-structured types), pointers, or C++ references.
step
is a compile-time integer constant expression, which defaults to 1 if omitted.
If more than one step is specified for a particular parameter, a compile-time error occurs.
Multiple
linear
clauses will be merged as a union.
The meaning of each variant of the clause is as follows:
linear(param[:step])
For parameters that are not C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the value of the parameter will be incremented by
step
. The clause can also be used for C++ references for backward compatibility, but it is not recommended.
linear(val(param[:step]))
For parameters that are C++ references: the clause tells the compiler that on each iteration of the loop from which the routine is called the referenced value of the parameter will be incremented by
step
.
linear(uval(param[:step]))
For C++ references: means the same as linear(val()). It differs from linear(val()) so that in case of linear(val()) a vector of references is passed to vector variant of the routine but in case of linear(uval()) only one reference is passed (and thus linear(uval()) is better to use in terms of performance).
linear(ref(param[:step]))
For C++ references: means that the reference itself is linear, i.e. the referenced values (that form a vector for calculations) are located sequentially, like in array with the distance between elements equal to
step
.
uniform
(param [, param,]…)
Where
param
is a formal parameter of the specified function or the C++ keyword
this
.
The
uniform
clause tells the compiler that the values of the specified arguments can be broadcast to all iterations as a performance optimization. It is often useful in generating more favorable vector memory references. On the other hand, lack of
uniform
clause may allow broadcast operations to be hoisted out of the caller loop. Evaluate carefully the performance implications. Multiple uniform clauses are merged as a union.
mask / nomask
The
mask
and
nomask
clauses tell the compiler to generate only masked or unmasked (respectively) vector variants of the routine. When omitted, both masked and unmasked variants are generated. The masked variant is used when the routine is called conditionally.
inbranch / notinbranch
The
inbranch
and
notinbranch
clauses are used with
#pragma omp declare simd
. The
inbranch
clause works the same as the
mask
clause above and the
notinbranch
clause works the same as the
nomask
clause above.
Write the code inside your function using existing C/C++ syntax and relevant built-in functions (see the section on
__intel_simd_lane()
below).

Usage of Vector Function Specifications

You may define several vector variants for one routine with each variant reflecting a possible usage of the routine. Encountering a call, the compiler matches vector variants with actual parameter kinds and chooses the best match. Matching is done by priorities. In other words, if an actual parameter is the loop invariant and the
uniform
clause was specified for the corresponding formal parameter, then the variant with the
uniform
clause has a higher priority. Linear specifications have the following order, from high priority to low:
linear(uval())
,
linear()
,
linear(val())
,
linear(ref())
. Consider the following example loops with the calls to the same routine.
Example: OpenMP*
// routine prototype #pragma omp declare simd // universal but slowest definition matches the use in all three loops #pragma omp declare simd linear(in1) linear(ref(in2)) uniform(mul) // matches the use in the first loop #pragma omp declare simd linear(ref(in2)) // matches the use in the second and the third loops #pragma omp declare simd linear(ref(in2)) linear(mul) // matches the use in the second loop #pragma omp declare simd linear(val(in2:2)) // matc