Developer Guide

Contents

Unroll Loops

You can control the way the 
Intel® oneAPI
DPC++/C++
Compiler
translates DPC++ kernel descriptions to hardware resources. If your DPC++ kernel contains loop iterations, increase performance by unrolling the loop. Loop unrolling decreases the number of iterations that the
Intel® oneAPI
DPC++/C++
Compiler
executes at the expense of increased hardware resource consumption.
Consider the DPC++ code for a parallel application in which each work-item is responsible for computing the accumulation of four elements in an array:
1 queue.submit([&](handler &cgh) { 2 auto x = x_buf.get_access<access::mode::read>(cgh); 3 auto sum = sum_buf.get_access<access::mode::write>(cgh); 4 cgh.single_task<class unoptimzed>([=]() { 5 int accum = 0; 6 7 for (size_t i = 0; i < 4; i++) { 8 accum += x[i + get_global_id(0) * 4]; 9 } 10 11 sum[get_global_id(0)] = accum; 12 }); 13 });
Observe the following three main operations that occur in this kernel:
  • Load operations from input
    x
  • Accumulation
  • Store operations to output sum
The
Intel® oneAPI
DPC++/C++
Compiler
arranges these operations in a pipeline according to the data flow semantics of the DPC++ kernel code. For example, the
Intel® oneAPI
DPC++/C++
Compiler
implements loops by forwarding the results from the end of the pipeline to the top of the pipeline, depending on the loop exit condition.
The DPC++ kernel performs one loop iteration of each work-item per clock cycle. With sufficient hardware resources, you can increase kernel performance by unrolling the loop, which decreases the number of iterations that the kernel executes. To unroll a loop, add a 
#pragma unroll
 directive to the main loop, as shown in the following code example:
Loop unrolling significantly changes the structure of the compute unit that the
Intel® oneAPI
DPC++/C++
Compiler
creates.
1 queue.submit([&](handler &cgh) { 2 auto x = x_buf.get_access<access::mode::read>(cgh); 3 auto sum = sum_buf.get_access<access::mode::write>(cgh); 4 cgh.single_task<class unoptimzed>([=]() { 5 int accum = 0; 6 7 #pragma unroll 8 for (size_t i = 0; i < 4; i++) { 9 accum += x[i + get_global_id(0) * 4]; 10 } 11 12 sum[get_global_id(0)] = accum; 13 }); 14 });
In this example, the 
#pragma unroll
 directive causes the
Intel® oneAPI
DPC++/C++
Compiler
to unroll four iterations of the loop completely. To accomplish the unrolling, the
Intel® oneAPI
DPC++/C++
Compiler
expands the pipeline by tripling the number of addition operations and loading four times more data. With the removal of the loop, the compute unit assumes a feed-forward structure. As a result, the compute unit can store the 
sum
 elements in every clock cycle after the completion of the initial load operations and additions. The
Intel® oneAPI
DPC++/C++
Compiler
further optimizes this kernel by coalescing the four load operations so that the compute unit can load all necessary input data to calculate a result in one load operation.
Avoid nested looping structures. Instead, implement a large single loop or unroll inner loops by adding the 
#pragma unroll
 directive whenever possible.
For example, if you compile a kernel that has a heavily nested loop structure, wherein each loop includes a 
#pragma unroll
 directive, you might experience a long compilation time. The 
Intel® oneAPI
DPC++/C++
Compiler
might fail to meet scheduling because it cannot unroll this nested loop structure easily, resulting in a high II. In this case, the
Intel® oneAPI
DPC++/C++
Compiler
issues the following error message along with the line number of the outermost loop:
Kernel <function> exceeded the Max II. The Kernel's resource usage is estimated to be much larger than FPGA capacity. It will perform poorly even if it fits. Reduce resource utilization of the kernel by reducing loop unroll factors within it (if any) or otherwise reduce amount of computation within the kernel.
Unrolling the loop and coalescing load operations from global memory allow the hardware implementation of the kernel to perform more operations per clock cycle.
The
Intel® oneAPI
DPC++/C++
Compiler
might not be able to unroll a loop completely under the following circumstances:
  • You specify complete unrolling of a data-dependent loop with a very large number of iterations. Consequently, the hardware implementation of your kernel might not fit into the FPGA.
  • You specify complete unrolling and the loop bounds are not constants.
  • The loop consists of complex control flows (for example, a loop containing complex array indexes or exit conditions that are unknown at compilation time).
For the last two cases listed above, the
Intel® oneAPI
DPC++/C++
Compiler
issues the following warning:
Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled.
To enable loop unrolling in these situations, specify the 
#pragma unroll <N>
 directive, where 
<N>
 is the unroll factor. The unroll factor limits the number of iterations that the
Intel® oneAPI
DPC++/C++
Compiler
unrolls. Refer to Single Work-item Kernel Design Guidelines for tips on constructing well-structured loops.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.