Developer Guide


Improve Loop Performance by Caching On-Chip Memory

In DPC++ task kernels for FPGA, the main objective is to achieve an initiation interval (II) of 1 on performance-critical loops. This means that a new loop iteration is launched on every clock cycle, thereby maximizing the loop's throughput. When the loop contains a loop-carried variable implemented in on-chip memory, the
Intel® oneAPI
often cannot achieve II=1 because the memory access takes more than one clock cycle. If the updated memory location is necessary on the next loop iteration, the next iteration must be delayed to allow time for the update, hence II > 1.
The on-chip memory cache technique breaks this dependency by storing recently-accessed values in a cache capable of a one-cycle read-modify-write operation. The cache is implemented in FPGA registers rather than on-chip memory. By pulling memory accesses preferentially from the register cache, the loop-carried dependency is broken.

When is the On-chip Memory Cache Technique Applicable?

You can apply the on-chip memory cache technique in the following situations:
  • Failure to achieve II=1 because of a loop-carried memory dependency in on-chip memory
    The on-chip memory cache technique is applicable if the compiler could not pipeline a loop with II=1 because of an on-chip memory dependency. If the compiler could not achieve II=1 because of a global memory dependency, this technique does not apply as the access latencies are too great.
    To check this for a given design, view the Loop Analysis report in the design's optimization report. The Loop Analysis report lists the II of all loops and explains why a lower II is not achievable. Check whether the reason given resembles
    the compiler failed to schedule this loop with smaller II due to memory dependency
    . The report describes the
    most critical loop feedback path during scheduling
    . Check whether this includes on-chip memory load/store operations on the critical path.
  • An II=1 loop with a load operation of latency 1
    The compiler is capable of reducing the latency of on-chip memory accesses to achieve II=1. In doing so, the compiler makes a trade-off by sacrificing f
    to improve the II.
    In a design with II=1 critical loops but lower than the desired f
    , the on-chip memory cache technique might still be applicable. It can help recover f
    by enabling the compiler to achieve II=1 with a higher latency memory access. To check whether this is the case for a given design, view the Kernel Memory Viewer report in the design's optimization report. Select the desired on-chip memory from the Kernel Memory List, and mouse over the load operation
    to check its latency. If the latency of the load operation is 1, this is a clear sign that the compiler has attempted to sacrifice f
    to improve loop II.

Implement the On-chip Memory Cache Technique

Consider the FPGA design example in
, which demonstrates the technique using a program that computes a histogram. The histogram operation accepts an input vector of values, separates the values into buckets, and counts the number of values per bucket. For each input value, an output bucket location is determined, and the count for the bucket is incremented. This count is stored in the on-chip memory, and the increment operation requires reading from memory, performing the increment, and storing the result. This read-modify-write operation is the critical path that can result in II > 1.
To reduce II, the idea is to store recently-accessed values in an FPGA register-implemented cache that is capable of a one-cycle read-modify-write operation. If the memory location required on a given iteration exists in the cache, it is pulled from there. The updated count is written back to both the cache and the on-chip memory. The
attribute is added to inform the compiler that if a loop-carried variable (namely, the variable storing the histogram output) is required within
iterations, it is guaranteed to be available right away.

Select the Cache Depth

While any value of
results in functional hardware, the ideal value of
requires some experimentation. The depth of the cache must roughly cover the latency of the on-chip memory access. To determine the correct value, Intel® recommends starting with a value of 2 and then increase it until both II = 1 and load latency > 1. In the
example, a
of 5 is necessary. It is important to find the minimal value of
that results in a maximal performance increase. Unnecessarily large values of
consume unnecessary FPGA resources and can reduce f
. Therefore, at a
that results in II=1 and load latency = 1, if further increases to
show no improvement, do not increase
any further.
For additional information, refer to the FPGA tutorial sample
Onchip Memory Cache
listed in the Intel® oneAPI Samples Browser on Linux* or Intel® oneAPI Samples Browser on Windows*, or access the code sample in GitHub.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at