Developer Guide

Contents

Single Work-item Kernel Design Guidelines

If your DPC++ kernels contain loop structures, follow the Intel®-recommended guidelines to construct the kernels in a way that allows the 
Intel® oneAPI DPC++/C++ Compiler
to analyze them effectively. Well-structured loops are particularly important to aid the compiler in generating a pipeline parallel datapath for loops.

Avoid Pointer Aliasing

If your DPC++ kernels have pointer arguments, you can improve the throughput of the design if the
Intel® oneAPI DPC++/C++ Compiler
can prove those arguments never point to the same memory location. It is possible to provide the compiler with information about pointer arguments in kernels. For more information, refer to Ignoring Dependencies Between Accessor Arguments.

Construct "Well-Formed" Loops

A
well-formed
loop has an exit condition that compares against an integer bound and has a simple induction increment. Including
well-formed
loops in your kernel improves performance because the
Intel® oneAPI DPC++/C++ Compiler
can analyze these loops efficiently.
The following example is a
well-formed
loop:
for (i = 0; i < N; i++) { //statements }
Well-formed
nested loops also contribute to maximizing kernel performance.
The following example is a
well-formed
nested loop structure:
for (i = 0; i < N; i++) { //statements for(j = 0; j < M; j++) { //statements } }

Minimize Loop-Carried Dependencies

The following loop structure creates a loop-carried dependence because each loop iteration reads data written by the previous iteration:
for (int i = 0; i < N; i++) { A[i] = A[i - 1] + i; }
As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies decreases the extent of pipeline parallelism that the
Intel® oneAPI DPC++/C++ Compiler
can achieve, which reduces kernel performance.
The
Intel® oneAPI DPC++/C++ Compiler
performs a static memory dependence analysis on loops to determine the extent of parallelism that it can achieve. In some cases, the
Intel® oneAPI DPC++/C++ Compiler
might assume loop-carried dependence:
  • Between two array accesses and as a result, extract less pipeline parallelism.
  • If it cannot resolve the dependencies at compilation time because of unknown variables or complex indexing expressions.
To minimize loop-carried dependencies, follow these guidelines whenever possible:
  • Avoid pointer arithmetic.
    Compiler output is suboptimal when the kernel accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array in the following manner:
    for (int i = 0; i < N; i++) { int t = *(A++); *A = t; }
  • Introduce simple, affine array indexes.
    Avoid the following types of complex array indexes because the
    Intel® oneAPI DPC++/C++ Compiler
    cannot analyze them effectively, which might lead to suboptimal compiler output:
    • Non-constants in array indexes. For example,
      A[K + i]
      , where
      i
      is the loop index variable and
      K
      is an unknown variable.
    • Multiple index variables in the same subscript location. For example,
      A[i + 2 × j]
      , where
      i
      and
      j
      are loop index variables for a double nested loop.
The
Intel® oneAPI DPC++/C++ Compiler
can analyze the array index 
A[i][j]
 effectively because the index variables are in different subscripts.

Avoid Complex Loop Exit Conditions

The
Intel® oneAPI DPC++/C++ Compiler
evaluates exit conditions to determine if subsequent loop iterations can enter the loop pipeline. Occasionally, the
Intel® oneAPI DPC++/C++ Compiler
requires memory accesses or complex operations to evaluate the exit condition. In these cases, subsequent iterations cannot launch until the evaluation completes, decreasing the overall loop performance.

Convert Nested Loops into a Single Loop

To maximize performance, combine nested loops into a single form whenever possible. Restructuring nested loops into a single loop reduces hardware footprint and computational overhead between loop iterations.
The following code examples illustrate the conversion of a nested loop into a single loop:
Table 8 Conversion of a Nested Loop into a Single Loop
Nested Loop
Converted Single Loop
for (i = 0; i < N; i++) { //statements for (j = 0; j < M; j++) { //statements } //statements }
for (i = 0; i < N*M; i++) { //statements }

Avoid Conditional Loops

To maximize performance, avoid declaring conditional loops. Conditional loops are tuples of loops that are declared within conditional statements such that one and only one of the loops is expected to be reached. These loops cannot be efficiently parallelized and result in a serialized implementation.
The following code examples illustrate the conversion of conditional loops to a more optimal implementation:
Conversion of a Conditional Loop to an Optimized Loop
Conditional Loops
Converted Loop
if (condition) { for (int i = 0; i < m; i++) { // statements } } else { for (int i = 0; i < m; i++) { // statements } }
for (int i = 0; i < m; i++) { if (condition) { // statements } else { // statements } }

Declare Variables in the Deepest Scope Possible

To reduce hardware resources necessary for implementing a variable, declare the variable prior to its use in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and hardware use because the
Intel® oneAPI DPC++/C++ Compiler
does not need to preserve the variable data across loops that do not use variables.
Consider the following example:
int a[N]; for (int i = 0; i < m; ++i) { int b[N]; for (int j = 0; j < n; ++j) { // statements } }
The array 
a
 requires more resources to implement than the array 
b
. To reduce hardware use, declare array 
a
outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.
Overwriting all values of a variable in the deepest scope possible also reduces resources necessary to present the variable.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804