Contents

# Relax Loop-Carried Dependency

## Example 1: Multiply Chain

Based on the feedback from the optimization report, you can relax a loop-carried dependency by allowing the compiler to infer a shift register and increase the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occur between the generation of a loop-carried value and its use.
Consider the following code example:
``````constexpr int N = 128;
queue.submit([&](handler &cgh) {
accessor result(result_buf, cgh, write_only);
float mul = 1.0f;
for (int i = 0; i < N; i++)
mul *= A[i];

result = mul;
});
});``````
The optimization report shows that the
Intel® oneAPI
DPC++/C++
Compiler
infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable
mul
causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 8 (that is,
mul *= A[i]
) contributes the largest delay to the computation of the variable
mul
.
To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on
M
copies of the variable, and use one copy every
M
iterations:
1. Declare multiple copies of the variable
mul
(for example, in an array called
mul_copies
).
2. Initialize all copies of
mul_copies
.
3. Use the last copy in the array in the multiplication operation.
4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
5. Reduce all copies to
mul
and write the final value to the result.
The following code illustrates the restructured kernel:
``````constexpr int N = 128;
constexpr int M = 8;
queue.submit([&](handler &cgh) {
accessor result(result_buf, cgh, write_only);
float mul = 1.0f;

// Step 1: Declare multiple copies of variable mul
float mul_copies[M];

// Step 2: Initialize all copies
for (int i = 0; i < M; i++)
mul_copies[i] = 1.0f;

for (int i = 0; i < N; i++) {
// Step 3: Perform multiplication on the last copy
float cur = mul_copies[M-1] * A[i];

// Step 4a: Shift copies
#pragma unroll
for (int j = M-1; j > 0; j--)
mul_copies[j] = mul_copies[j-1];

// Step 4b: Insert updated copy at the beginning
mul_copies = cur;
}

// Step 5: Perform reduction on copies
#pragma unroll
for (int i = 0; i < M; i++)
mul *= mul_copies[i];

result = mul;
});
});``````

## Example 2: Accumulator

Similar to Example 1, this example also applies the same technique and demonstrates how to write single work-item kernels that carry out double precision floating-point operations efficiently by inferring a shift register.
Consider the following example:
``````queue.submit([&](handler &cgh) {
accessor result(result_buf, cgh, write_only);
double temp_sum = 0;
for (int i = 0; i < *N; ++i)
temp_sum += arr[i];
result = temp_sum;
});
});``````
The kernel
unoptimized
is an accumulator that sums the elements of a double precision floating-point array
arr[i]
. For each loop iteration, the
Intel® oneAPI
DPC++/C++
Compiler
takes 10 cycles to compute the result of the addition and then stores it in the variable
temp_sum
. Each loop iteration requires the value of
temp_sum
from the previous loop iteration, which creates a data dependency on
temp_sum
.
To relax the data dependency, infer the array
arr[i]
as a shift register.
The following is the restructured kernel
optimized
:
``````//Shift register size must be statically determinable
constexpr int II_CYCLES = 12;
queue.submit([&](handler &cgh) {
accessor result(result_buf, cgh, write_only);
//Create shift register with II_CYCLE+1 elements
double shift_reg[II_CYCLES+1];

//Initialize all elements of the register to 0
for (int i = 0; i < II_CYCLES + 1; i++) {
shift_reg[i] = 0;
}

//Iterate through every element of input array
for(int i = 0; i < N; ++i){
//Load ith element into end of shift register
//if N > II_CYCLE, add to shift_reg to preserve values
shift_reg[II_CYCLES] = shift_reg + arr[i];

#pragma unroll
//Shift every element of shift register
for(int j = 0; j < II_CYCLES; ++j)
{
shift_reg[j] = shift_reg[j + 1];
}
}

//Sum every element of shift register
double temp_sum = 0;

#pragma unroll
for(int i = 0; i < II_CYCLES; ++i)
{
temp_sum += shift_reg[i];
}

result = temp_sum;
});
}); ``````

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.