Developer Guide

Contents

Relax Loop-Carried Dependency

Example 1: Multiply Chain

Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.
Consider the following code example:
constexpr int N = 128; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); cgh.single_task<class unoptimized>([=]() { float mul = 1.0f; for (int i = 0; i < N; i++) mul *= A[i]; result[0] = mul; }); });
The optimization report shows that the 
Intel® oneAPI
DPC++/C++
Compiler
infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable 
mul
 causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, 
mul *= A[i]
) contributes the largest delay to the computation of the variable 
mul
.
To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on 
M
 copies of the variable, and use one copy every 
M
 iterations:
  1. Declare multiple copies of the variable 
    mul
     (for example, in an array called 
    mul_copies
    ).
  2. Initialize all copies of 
    mul_copies
    .
  3. Use the last copy in the array in the multiplication operation.
  4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
  5. Reduce all copies to 
    mul
     and write the final value to the result.
The following code illustrates the restructured kernel:
constexpr int N = 128; constexpr int M = 8; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); cgh.single_task<class optimized>([=]() { float mul = 1.0f; // Step 1: Declare multiple copies of variable mul float mul_copies[M]; // Step 2: Initialize all copies for (int i = 0; i < M; i++) mul_copies[i] = 1.0f; for (int i = 0; i < N; i++) { // Step 3: Perform multiplication on the last copy float cur = mul_copies[M-1] * A[i]; // Step 4a: Shift copies #pragma unroll for (int j = M-1; j > 0; j--) mul_copies[j] = mul_copies[j-1]; // Step 4b: Insert updated copy at the beginning mul_copies[0] = cur; } // Step 5: Perform reduction on copies #pragma unroll for (int i = 0; i < M; i++) mul *= mul_copies[i]; result[0] = mul; }); });

Example 2: Accumulator

Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.
Consider the following example:
queue.submit([&](handler &cgh) { accessor arr(arr_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); accessor N(N_buf, cgh, read_only); cgh.single_task<class unoptimized>([=]() { double temp_sum = 0; for (int i = 0; i < *N; ++i) temp_sum += arr[i]; result[0] = temp_sum; }); });
The kernel
unoptimized
is an accumulator that sums the elements of a double precision floating-point array
arr[i]
. For each loop iteration, the
Intel® oneAPI
DPC++/C++
Compiler
takes 10 cycles to compute the result of the addition and then stores it in the variable
temp_sum
. Each loop iteration requires the value of
temp_sum
from the previous loop iteration, which creates a data dependency on
temp_sum
.
To relax the data dependency, infer the array
arr[i]
as a shift register.
The following is the restructured kernel
optimized
:
//Shift register size must be statically determinable constexpr int II_CYCLES = 12; queue.submit([&](handler &cgh) { accessor arr(arr_buf, cgh, read_only); accessor result(result_buf, cgh, write_only); accessor N(N_buf, cgh, read_only); cgh.single_task<class optimized>([=]() { //Create shift register with II_CYCLE+1 elements double shift_reg[II_CYCLES+1]; //Initialize all elements of the register to 0 for (int i = 0; i < II_CYCLES + 1; i++) { shift_reg[i] = 0; } //Iterate through every element of input array for(int i = 0; i < N; ++i){ //Load ith element into end of shift register //if N > II_CYCLE, add to shift_reg[0] to preserve values shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; #pragma unroll //Shift every element of shift register for(int j = 0; j < II_CYCLES; ++j) { shift_reg[j] = shift_reg[j + 1]; } } //Sum every element of shift register double temp_sum = 0; #pragma unroll for(int i = 0; i < II_CYCLES; ++i) { temp_sum += shift_reg[i]; } result[0] = temp_sum; }); });

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.