Developer Guide

Contents

Pipelining

Similar to the implementation of a CPU with multiple pipeline stages, the compiler generates a deeply-pipelined hardware datapath. For more information, refer to Concepts of FPGA Hardware Design and How Source Code Becomes a Custom Hardware Datapath. Pipelining allows for many data items to be processed concurrently (in the same clock cycle) while making efficient use of the hardware in the datapath by keeping it occupied.
Example of Vectorization of the Datapath vs. Pipelining the Datapath
Consider the following example of code mapping to hardware :
Example Code Mapping to Hardware
 Example Code Mapping to Hardware
A naïve approach to parallelizing this code on FPGAs would be to vectorize it by replicating the datapath or by explicitly writing instructions that operate on vector types, as illustrated in the following figure:
Vectorizing the Datapath Resulting in High Throughput but Low Occupancy
Vectorizing the Datapath Resulting in High Throughput but Low Occup
In the above diagram,
t0
,
t1
,
t2
and so on, represent different data items flowing through the datapath. However, this vectorization results in an inefficient use of the hardware since many of the operations are sitting idle while other parts of the datapath are operating on the data. An alternative approach is to increase occupancy of the existing hardware by sending a new data item into the datapath on every clock cycle, as illustrated in the following figure:
Pipelining the Datapath Results in High Throughput and High Occupancy
Pipelining the Datapath Results in High Throughput and High Occupancy
In this example, both approaches achieve the same throughput of one data item per cycle. However, the pipelined implementation uses one third of the hardware and keeps that hardware utilized more effectively.
Understanding where the data you need to pipeline is coming from is key to achieving high performance designs on the FPGA. You can use the following sources of data to take advantage of pipelining:
  • Work items
  • Loop iterations

Pipelining Loops Within a Single Work Item

Within a single work item kernel, loops are the primary source of pipeline parallelism. When the
Intel® oneAPI DPC++/C++ Compiler
pipelines a loop, it attempts to schedule the loop’s execution such that the next iteration of the loop enters the pipeline before the previous iteration has completed. This pipelining of loop iterations can lead to higher performance.
The number of clock cycles between iterations of the loop is called the
Initiation Interval
(II). For highest performance, a new iteration of the loop would be able to start every clock cycle corresponding to an II of 1. Data dependencies that are carried from one iteration of the loop to another can affect the ability to achieve an II of 1. These dependencies are called
loop carried dependencies
. The II of the loop must be high enough to accommodate all loop carried dependencies.
The II required to satisfy this constraint is a function of the f
MAX
of the design. If the f
MAX
is lower, the II might also be lower. Conversely, if the f
MAX
is higher, a higher II might be required.
The
Intel® oneAPI DPC++/C++ Compiler
automatically identifies these dependencies and builds hardware to resolve them while minimizing the II, subject to the target f
MAX
. Refer to Figure 4 where the source code describes a dependency between loop iterations (that is, accesses to array
c
) that can be satisfied within one clock cycle. This allows the loop to achieve an II of 1 and maximize performance.
Pipelining a Datapath with Loop Iteration
Pipelining a Datapath with Loop Iteration
The dependency on the value stored to
c
in the previous iteration is resolved in a single clock cycle, so an II of 1 is achieved for the loop even though the iterations are not independent.
In cases where the
Intel® oneAPI DPC++/C++ Compiler
cannot initially achieve II of 1, you have several optimization strategies to choose from:
  • Interleaving
    : When a loop nest with an inner loop II that is greater than 1, the
    Intel® oneAPI DPC++/C++ Compiler
    can attempt to interleave iterations of the outer loop into iterations of the inner loop to better utilize the hardware resources and achieve higher throughput.
    Interleaving
    Interleaving
  • Speculative Execution
    : In cases where the critical path affecting II is not a loop carried dependency but the computation of the exit condition of the loop, the
    Intel® oneAPI DPC++/C++ Compiler
    can attempt to relax this scheduling constraint by speculatively continuing to execute iterations of the loop while the exit condition is being computed. If it is determined that the exit condition is satisfied, the effects of these extra iterations are suppressed. This can achieve lower II and higher throughput, but it means there are some overhead between invocations of the loop equivalent to the number of speculated iterations. A larger loop trip count helps to minimize this overhead.
    Speculative Execution
    Speculative Execution
These optimizations are applied automatically by the
Intel® oneAPI DPC++/C++ Compiler
, and additionally can be controlled through pragma statements and loop attributes in the design. For additional information, refer to speculated_iterations Attribute

Pipelining Across Multiple Work Items

A range of work items represents a large set of independent data that can be processed in parallel. Since there are no implicit dependencies between work items, at every clock cycle, the next work item can always enter kernel’s datapath before previous work items have completed, unless there is a dynamic stall in the datapath. The following figure illustrates the pipeline in Figure 1 filled with work items:
Loops are not pipelined for kernels that use more than one work item in the current version of the compiler. This will be relaxed in a future release.
Pipelining a Datapath with DPC++ Work Items
Pipelining a Datapath with DPC++ Work Items

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804