Developer Guide

Contents

Transfer Loop-Carried Dependency to Local Memory

Loop-carried dependencies can adversely affect the loop initiation interval or II (refer to Pipelining Across Multiple Work Items). For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.
Consider the following example:
1 constexpr int N = 128; 2 queue.submit([&](handler &cgh) { auto A = A_buf.get_access<access::mode::read_write>(cgh); 4 cgh.single_task<class unoptimized>([=]() { 5 for (unsigned i = 0; i < N; i++) 6 A[N-i] = A[i]; 7 } 8 }); 9 });
Global memory accesses have long latencies. In this example, the loop-carried dependency on the array 
A[i]
causes long latency. The optimization report reflects this latency with an II of 185. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:
  1. Copy the array with the loop-carried dependency to local memory. In this example, array 
    A[i]
     becomes array 
    B[i]
     in local memory.
  2. Execute the loop with the loop-carried dependence on array 
    B[i]
    .
  3. Copy the array back to global memory.
When you transfer array 
A[i]
 to local memory and it becomes array 
B[i]
, the loop-carried dependency is now on 
B[i]
. Because local memory has a much lower latency than global memory, the II value improves.
Following is the restructured kernel
optimized
:
1 constexpr int N = 128; 2 queue.submit([&](handler &cgh) { 3 auto A = A_buf.get_access<access::mode::read_write>(cgh); 4 cgh.single_task<class optimized>([=]() { 5 int B[N]; 6 7 for (unsigned i = 0; i < N; i++) 8 B[i] = A[i]; 9 10 for (unsigned i = 0; i < N; i++) 11 B[N-i] = B[i]; 12 13 for (unsigned i = 0; i < N; i++) 14 A[i] = B[i]; 15 }); 16 });

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.