Developer Guide

Contents

Transfer Loop-Carried Dependency to Local Memory

Loop-carried dependencies can adversely affect the loop initiation interval or II (refer to Pipelining Across Multiple Work Items). For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.
Consider the following example:
constexpr int N = 128; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_write); cgh.single_task<class unoptimized>([=]() { for (int i = 0; i < N; i++) A[N-i] = A[i]; } }); });
Global memory accesses have long latencies. In this example, the loop-carried dependency on the array 
A[i]
causes long latency. The optimization report reflects this latency with an II of 185. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:
  1. Copy the array with the loop-carried dependency to local memory. In this example, array 
    A[i]
     becomes array 
    B[i]
     in local memory.
  2. Execute the loop with the loop-carried dependence on array 
    B[i]
    .
  3. Copy the array back to global memory.
When you transfer array 
A[i]
 to local memory and it becomes array 
B[i]
, the loop-carried dependency is now on 
B[i]
. Because local memory has a much lower latency than global memory, the II value improves.
Following is the restructured kernel
optimized
:
constexpr int N = 128; queue.submit([&](handler &cgh) { accessor A(A_buf, cgh, read_write); cgh.single_task<class optimized>([=]() { int B[N]; for (int i = 0; i < N; i++) B[i] = A[i]; for (int i = 0; i < N; i++) B[N-i] = B[i]; for (int i = 0; i < N; i++) A[i] = B[i]; }); });

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.