Developer Guide

Contents

Loop Fusion

Loop fusion is a compiler transformation in which two adjacent loops are merged into a single loop over the same index range. This transformation is typically applied to reduce loop overhead and improve run-time performance.
The following example shows the effects of fusing loops in a simple case:
Unfused Loops
Fused Loops
for (int i=0 ; i<10;  i++ ) {   for(int j=0 ; j<300 ; j++){      a[j] = localMem[j] + 3;   }   for(int k=0 ; k<300 ; k++){      b[k] = localMem[k] + 4;   } }
for (int i=0; i<10; i++ ) {   for(int j=0 ; j<300 ; j++){ int localMemVal = localMem[j];     a[j] = localMemVal + 3;     b[k] = localMemVal + 4;   } }
Loop control structures represent a significant overhead. By fusing two loops, the number of control structures needed for the loops is reduced from two to one, reducing this overhead. The main goal of reducing the number of control structures is to save FPGA area for your design while still maintaining (ideally increasing) kernel throughput.
Fusing outer loops introduces concurrency where there was previously none. Combining bodies of two adjacent loops (L
j
and L
k
) forms a single loop (L
f
) with a loop body that spans the bodies of L
j
and L
k
. This combined loop body creates an opportunity for operations that were serialized across a given iteration of L
j
and L
k
to execute concurrently. In effect, the two loops now execute as one, reducing latency.
If inner loops are fused, concurrency is already achieved by pipelined execution of the outer loop iteration. In these cases, the concurrency effect of loop fusion is diminished.

Fusion Criteria

The compiler considers the fusion of two loops (L
j
and L
k
) to be valid if the loops meet the following criteria:
  • Loops must be adjacent. That is, you cannot have a statement S
    i
    with side-effects such that S
    i
    executes after L
    j
    and before L
    k
    .
  • Each loop must have a single-entry point and a single exit point. For example, loops that contain break statements are not considered for fusion.
  • Loops must have no negative-distance dependencies. That is, for loops L
    j
    and L
    k
    where L
    j
    is defined before L
    k
    , iteration
    m
    of loop L
    k
    does not depend on values calculated in iteration
    m+n
    (where
    n
    >0) of loop L
    j
    .

Automatic Loop Fusion

The
Intel® oneAPI DPC++/C++ Compiler
fuses loops with the same trip counts automatically if the compiler analysis of your kernel determines that fusing the loops is profitable.
Examples of where fusing loops is a valid transformation (based on the earlier criteria) but are not considered profitable by the compiler include the following situations:
  • One of the two loops, but not both, is annotated with the ivdep attribute.
  • One of the two loops, but not both, contains stall-free logic.
The Loop Analysis report indicates when loops were fused.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804