Relax Loop-Carried Dependency
Example 1: Multiply Chain
Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.
Consider the following code example:
1 constexpr int N = 128; 2 queue.submit([&](handler &cgh) { 3 auto A = A_buf.get_access<access::mode::read>(cgh); 4 auto result = result_buf.get_access<access::mode::write>(cgh); 5 cgh.single_task<class unoptimized>([=]() { 6 float mul = 1.0f; 7 for (unsigned i = 0; i < N; i++) 8 mul *= A[i]; 9 10 result[0] = mul; 11 }); 12 });
The optimization report
shows that the
Intel® oneAPI
infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable DPC++/C++
Compilermul
causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is, mul *= A[i]
) contributes the largest delay to the computation of the variable mul
.
To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on
M
copies of the variable, and use one copy every M
iterations:
- Declare multiple copies of the variablemul(for example, in an array calledmul_copies).
- Initialize all copies ofmul_copies.
- Use the last copy in the array in the multiplication operation.
- Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
- Reduce all copies tomuland write the final value to the result.
The following code illustrates the restructured kernel:
1 constexpr int N = 128; 2 constexpr int M = 8; 3 queue.submit([&](handler &cgh) { 4 auto A = A_buf.get_access<access::mode::read>(cgh); 5 auto result = result_buf.get_access<access::mode::write>(cgh); 6 cgh.single_task<class optimized>([=]() { 7 float mul = 1.0f; 8 9 // Step 1: Declare multiple copies of variable mul 10 float mul_copies[M]; 11 12 // Step 2: Initialize all copies 13 for (unsigned i = 0; i < M; i++) 14 mul_copies[i] = 1.0f; 15 16 for (unsigned i = 0; i < N; i++) { 17 // Step 3: Perform multiplication on the last copy 18 float cur = mul_copies[M-1] * A[i]; 19 20 // Step 4a: Shift copies 21 #pragma unroll 22 for (unsigned j = M-1; j > 0; j--) 23 mul_copies[j] = mul_copies[j-1]; 24 25 // Step 4b: Insert updated copy at the beginning 26 mul_copies[0] = cur; 27 } 28 29 // Step 5: Perform reduction on copies 30 #pragma unroll 31 for (unsigned i = 0; i < M; i++) 32 mul *= mul_copies[i]; 33 34 result[0] = mul; 35 }); 36 });
Example 2: Accumulator
Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.
Consider the following example:
1 queue.submit([&](handler &cgh) { 2 auto arr = arr_buf.get_access<access::mode::read>(cgh); 3 auto result = result_buf.get_access<access::mode::write>(cgh); 4 auto N = N_buf.get_access<access::mode::read>(cgh); 5 cgh.single_task<class unoptimized>([=]() { 6 double temp_sum = 0; 7 8 for (int i = 0; i < *N; ++i) 9 temp_sum += arr[i]; 10 11 result[0] = temp_sum; 12 }); 13 });
The kernel
unoptimized
is an accumulator that sums the elements of a double precision floating-point array
arr[i]
. For each loop iteration, the
Intel® oneAPI
takes 10 cycles to compute the result of the addition and then stores it in the variable
DPC++/C++
Compilertemp_sum
. Each loop iteration requires the value of
temp_sum
from the previous loop iteration, which creates a data dependency on
temp_sum
.
To relax the data dependency, infer the array
arr[i]
as a shift register.
The following is the restructured kernel
optimized
:
1 //Shift register size must be statically determinable 2 constexpr int II_CYCLES = 12; 3 queue.submit([&](handler &cgh) { 4 auto arr = arr_buf.get_access<access::mode::read>(cgh); 5 auto result = result_buf.get_access<access::mode::write>(cgh); 6 auto N = N_buf.get_access<access::mode::read>(cgh); 7 cgh.single_task<class optimized>([=]() { 8 //Create shift register with II_CYCLE+1 elements 9 double shift_reg[II_CYCLES+1]; 10 11 //Initialize all elements of the register to 0 12 for (int i = 0; i < II_CYCLES + 1; i++) 13 { 14 shift_reg[i] = 0; 15 } 16 17 //Iterate through every element of input array 18 for(int i = 0; i < N; ++i) 19 { 20 //Load ith element into end of shift register 21 //if N > II_CYCLE, add to shift_reg[0] to preserve values 22 shift_reg[II_CYCLES] = shift_reg[0] + arr[i]; 23 24 #pragma unroll 25 //Shift every element of shift register 26 for(int j = 0; j < II_CYCLES; ++j) 27 { 28 shift_reg[j] = shift_reg[j + 1]; 29 } 30 } 31 32 //Sum every element of shift register 33 double temp_sum = 0; 34 35 #pragma unroll 36 for(int i = 0; i < II_CYCLES; ++i) 37 { 38 temp_sum += shift_reg[i]; 39 } 40 41 result[0] = temp_sum; 42 }); 43 });