Developer Guide

Contents

Refactor the Loop-Carried Data Dependency

Based on the feedback from the optimization report, you can restructure your program to reduce the critical path of a pipelined loop by reducing the distance between the first time you use a loop-updated variable in the loop body and the last time you define it within a single iteration.
Consider the following code:
1 constexpr int N = 128; 2 queue.submit([&](handler &cgh) { 3 auto A = A_buf.get_access<access::mode::read>(cgh); 4 auto B = B_buf.get_access<access::mode::read>(cgh); 5 auto Result = Result_buf.get_access<access::mode::write>(cgh); 6 cgh.single_task<class unoptimized>([=]() { 7 int sum = 0; 8 for (unsigned i = 0; i < N; i++) { 9 for (unsigned j = 0; j < N; j++) { 10 sum += A[i * N + j]; 11 } 12 sum += B[i]; 13 } 14 Result[0] = sum; 15 }); 16 });
  • The report indicates that the
    Intel® oneAPI DPC++/C++ Compiler
    successfully infers pipelined execution for the outer loop, and a new loop iteration launches every other cycle.
  • The only a single loop iteration will execute message in the first row of the Details pane indicates that the loop executes a single iteration at a time across the subloop because of the data dependency on the variable sum. This data dependency exists because each outer loop iteration requires the value of sum from the previous iteration to return before the inner loop can start executing. Moreover, the serialization is enforced by a critical path spanning from the first use of sum in the body of the loop i to the last definition of sum at the end of the body of loop j.
  • The second row of the report notifies you that the inner loop executes in a pipelined manner with no performance-limiting loop-carried dependencies.
For recommendations on how to structure your single work-item kernel, refer to Single Work-item Kernel Design Guidelines.
To optimize performance of this kernel, reduce the length of the critical path induced by the data dependency on variable sum so that the outer loop iterations do not execute serially across the subloop. Perform the following tasks to decouple the computations involving sum in the two loops:
  1. Define a local variable (for example, 
    sum2
    ) for use in the inner loop only.
  2. Use the local variable from Step 1 to store the cumulative values of 
    A[i*N + j]
     as the inner loop iterates.
  3. In the outer loop, store the variable 
    sum
     to store the cumulative values of 
    B[i]
     and the value stored in the local variable.
Following code illustrates the restructured
optimized
kernel snippet:
1 constexpr int N = 128; 2 queue.submit([&](handler &cgh) { 3 auto A = A_buf.get_access<access::mode::read>(cgh); 4 auto B = B_buf.get_access<access::mode::read>(cgh); 5 auto Result = Result_buf.get_access<access::mode::write>(cgh); 6 cgh.single_task<class optimized>([=]() { 7 int sum = 0; 8 for (unsigned i = 0; i < N; i++) { 9 // Step 1: Definition 10 int sum2 = 0; 11 // Step 2: Accumulation of array A values for one outer 12 // loop iteration 13 for (unsigned j = 0; j < N; j++) { 14 sum2 += A[i * N + j]; 15 } 16 // Step 3: Addition of array B value for an outer loop iteration 17 sum += sum2; 18 sum += B[i]; 19 } 20 Result[0] = sum; 21 }); 22 });

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804