Developer Guide

Contents

Global Memory Accesses Optimization

The 
Intel® oneAPI
DPC++/C++
Compiler
uses SDRAM as global memory. By default, the compiler configures global memory in a burst-interleaved configuration. The
Intel® oneAPI
DPC++/C++
Compiler
interleaves global memory across each of the external memory banks.
In most circumstances, the default burst-interleaved configuration leads to the best load balancing between memory banks. However, in some cases, you might want to partition the banks manually as two non-interleaved (and contiguous) memory regions to achieve better load balancing.
The following figure illustrates the difference in memory mapping patterns between burst-interleaved and non-interleaved memory partitions:
Global Memory Partitions
Global Memory Partitions

Global Memory Bandwidth Use Calculation

To ensure the global memory bandwidth listed in the board specification file is utilized completely, calculating the kernel bandwidth use is beneficial. The report.html file also displays the kernel bandwidth values in the global memory view of the System Viewer. The following formulas explain how you can calculate this value on a per-LSU basis:
Formulas for Calculating Kernel Bandwidth Use
The LSU bandwidth equation is the minimum of three bottlenecks you need to calculate the use of global memory bandwidth. The remaining equations represent three bottlenecks that can limit the LSU bandwidth. These formulas represent the theoretical maximum bandwidth an LSU may consume, ignoring all other LSUs. The actual bandwidth depends on the LSU's access pattern and the interconnect's arbitration between all LSUs. To get an estimate of the overall bandwidth, a sum of the LSU bandwidths is available in the controller of the global memory view of the System Viewer.
The following table describes the variables used in the above equations:
Variables Used in Calculating Kernel Bandwidth
Variable
Description
KWIDTH
Byte-width of the LSU on the kernel. In the
report.html
file, it is referred to as
WIDTH
.
MWIDTH
Byte-width of the LSU facing the external memory. In the
report.html
file, it is referred to as the
<Memory Name>_Width
.
FMAX
Clock speed of the kernel in MHz. In the
report.html
file, you can identify this as the design’s clock speed.
MaxBandwidth
Maximum bandwidth (measured in MB/s) the global memory can achieve. You can find this in the
board_spec.xml
file for the specific global memory.
NUM_CHANNELS
Number of interfaces an external memory has. You can find this by counting the number of interfaces listed in the
board_spec.xml
file under that memory.
NUM_INTERLEAVING_CHANNELS
When interleaving is enabled, this is the number of channels. Otherwise, this value is 1.
BW
1
Bottleneck at the kernel boundary. Therefore,
BW
1
uses only kernel values, which means, values you can change by optimizing the design. If this is limiting the overall bandwidth use than it indicates, changing your design can improve the bottleneck at the kernel boundary.
BW
2
Bottleneck at the memory interface to the kernel. Therefore,
BW
2
uses the size of the memory interface and the f
MAX
, which means either improving f
MAX
of your design or switching to a board with a wider memory interface can improve the bandwidth use.
BW
3
Bottleneck in the external memory. Therefore,
BW
3
uses external memory properties exclusively, and if this is limiting your design, you have utilized the board bandwidth completely.

Manual Partition of Global Memory

You can partition the memory manually so that each buffer occupies a different memory bank.
The default burst-interleaved configuration of the global memory prevents load imbalance by ensuring that memory accesses do not favor one external memory bank over another. However, you have the option to control the memory bandwidth across a group of buffers by partitioning your data manually.
The 
Intel® oneAPI
DPC++/C++
Compiler
cannot burst-interleave across different memory types. To manually partition a specific type of global memory, compile your DPC++ kernels with the 
-Xsno-interleaving=<global_memory_type>
 flag to configure each bank of a certain memory type as non-interleaved banks.
If your kernel accesses two buffers of equal size in memory, you can distribute your data to both memory banks simultaneously regardless of dynamic scheduling between the loads. This optimization step might increase your apparent memory bandwidth.
If your kernel accesses heterogeneous global memory types, include the 
-Xsno-interleaving=<global_memory_type>
 option in the 
clang++
command for each memory type that you want to partition manually.

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

The board support package for your FPGA board can assemble a global memory space consisting of different memory technologies (for example, DRAM or SRAM). The board support package designates one such memory (which might consist of multiple interfaces) as the default memory. All buffers reside in this heterogeneous memory.
To use the heterogeneous memory, perform the following steps to modify the code in your source file:
  1. Determine the names of global memory types available on your FPGA board using one of the following methods:
    • Refer to the board vendor's documentation for information.
    • Identify the index of the global memory type in the
      board_spec.xml
      file of your Custom Platform. The index starts at 0 and follows the order in which the global memory appears in the
      board_spec.xml
      . For example, the first global memory type in the XML file has an index 0, the second has an index 1, and so on. For more information, refer to
      global_mem
      in the
      Intel® FPGA SDK for OpenCL™ Pro Edition Custom Platform Toolkit User Guide
      .
  2. To instruct the host to allocate a buffer to a specific global memory type, insert the
    buffer_location
    <index>
    property in the accessor's property list.
    For example:
    ONEAPI::accessor_property_list PL{INTEL::buffer_location<2>}; accessor acc(buffer, cgh, read_only, PL);
If you do not specify the
buffer_location
property, the host allocates the buffer to the default memory type automatically. To determine the default memory type, consult the documentation provided by your board vendor. Alternatively, in the
board_spec.xml
file of your Custom Platform, search for the memory type that is defined first or has the attribute
default=1
assigned to it. For more information, refer to
Intel® FPGA SDK for OpenCL™ Pro Edition Custom Platform Toolkit User Guide
.
Streams support is limited if the target FPGA is used in a FPGA board with heterogeneous memory. With SYCL streams, all work items write to the same stream buffer in parallel, so atomics are leveraged to ensure race conditions. Atomics support for boards that have heterogeneous memory is limited. For more information about atomics and streams, refer to SYCL specifications available at https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf and https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf.

Ignore Dependencies Between Accessor Arguments

You can direct the
Intel® oneAPI
DPC++/C++
Compiler
to ignore dependencies between accessor arguments in a DPC++ kernel in one of the following methods. Both methods allow the compiler to analyze dependencies between kernel memory operations more accurately, which can result in higher performance.
Method 1: Add the
[[intel::kernel_args_restrict]]
Attribute to Your Kernel
To direct the
Intel® oneAPI
DPC++/C++
Compiler
to ignore dependencies between accessor arguments in a DPC++ kernel, add the
[[intel::kernel_args_restrict]]
attribute to your kernel. You can apply the
[[intel::kernel_args_restrict]]
attribute at a more fine-grained level to individual kernels in the source code.
Example
#include <CL/sycl/INTEL/fpga_extensions.hpp> ... event event_restrict = device_queue.submit([&](handler& cgh) { // create accessors from global memory accessor in_accessor(in_buf, cgh, read_only); accessor out_accessor(out_buf, cgh, write_only); // run the task (note the use of the attribute here) cgh.single_task<KernelArgsRestrict>([=]() [[intel::kernel_args_restrict]] { for (int i = 0; i < N; i++) { out_accessor[i] = in_accessor[i]; } }); });
This attribute is an assurance to the compiler that accessors in the kernel arguments (and accessors derived from them) never point to the same memory location as any other accessor. It is up to you to ensure that this property is true.
For additional information, refer to the FPGA tutorial sample "Kernel Args Restrict" listed in the Intel® oneAPI Samples Browser on Linux* or Intel® oneAPI Samples Browser on Windows*.
Method 2: Add the
no_alias
Property to an Accessor's Property List
The
no_alias
property notifies the
Intel® oneAPI
DPC++/C++
Compiler
that all modifications to the memory locations accessed (directly or indirectly) by an accessor during kernel execution is done through the same accessor (directly or indirectly) and not by any other accessor or USM pointer in the kernel. This is an unchecked assertion by the programmer and results in an undefined behavior if it is violated. Effectively, applying
no_alias
to all accessors of a kernel is equivalent to applying the
[[intel::kernel_args_restrict]]
attribute to the kernel unless the kernel uses USM. You cannot apply the
no_alias
property on a USM pointer.
Example
ONEAPI::accessor_property_list PL{ONEAPI::no_alias}; accessor acc(buffer, cgh, PL);

Contiguous Memory Accesses

The
Intel® oneAPI
DPC++/C++
Compiler
attempts to dynamically coalesce accesses to adjacent memory locations to improve global memory efficiency. This is effective if consecutive work items access consecutive memory locations in a given load or store operation. The same is true in a
single_task
invocation if consecutive loop iterations access consecutive memory locations.
Consider the following code example:
q.submit([&](handler &cgh) { accessor a(a_buf, cgh, read_only); accessor b(b_buf, cgh, read_only); accessor c(c_buf, cgh, write_only, noinit); cgh.parallel_for<class SimpleVadd>(N, [=](id<1> ID) { c[ID] = a[ID] + b[ID]; }); });
The load operation from the accessor 
a
 uses an index that is a direct function of the work-item global ID. By basing the accessor index on the work-item global ID, the
Intel® oneAPI
DPC++/C++
Compiler
can ensure contiguous load operations. These load operations retrieve the data sequentially from the input array and send the read data to the pipeline as required. Contiguous store operations then store elements of the result that exits the computation pipeline in sequential locations within global memory.
The following figure illustrates an example of the contiguous memory access optimization:
Contiguous Memory Access
Contiguous Memory Access
Contiguous load and store operations improve memory access efficiency because they lead to increased access speeds and reduced hardware resource needs. The data travels in and out of the computational portion of the pipeline concurrently, allowing overlaps between computation and memory accesses. Where possible, use work-item IDs that index accesses to arrays in global memory to maximize memory bandwidth efficiency.

Static Memory Coalescing

Static memory coalescing is an 
Intel® oneAPI
DPC++/C++
Compiler
optimization step that merges contiguous accesses to global memory into a single wide access. A similar optimization is applied to on-chip memory.
The figure below shows a common case where kernel performance might benefit from static memory coalescing:
Static Memory Coalescing
Static Memory Coalescing
Consider the following vectorized kernel:
q.submit([&](handler &cgh) { accessor a(a_buf, cgh, read_only); accessor b(b_buf, cgh, write_only, noinit); cgh.single_task<class coalesced>([=]() { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[i]; }); });
The DPC++ kernel performs four load operations from buffer a that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the compiler coalesces the four loads into a single, wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the compiler performs static memory coalescing automatically, you should use wide vector loads and stores in your DPC++ code whenever possible to ensure efficient memory accesses.
To allow static memory coalescing, you must write your code in such a way that the compiler can identify a sequential access pattern during compilation. The original kernel code shown in the figure above can benefit from static memory coalescing because all indexes into buffers 
a
 and 
b
 increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:
q.submit([&](handler &cgh) { accessor a(a_buf, cgh, read_only); accessor b(b_buf, cgh, write_only); accessor offsets(offsets_buf, cgh, read_only); cgh.single_task<class not_coalesced>([=]() { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[offsets[i]]; }); });
The value 
offsets[i]
 is unknown at compilation time. As a result, the
Intel® oneAPI
DPC++/C++
Compiler
cannot statically coalesce the read accesses to buffer 
a
.
For more information, refer to Local and Private Memory Accesses Optimization.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.