Developer Guide

Contents

Global Memory Accesses Optimization

The 
Intel® oneAPI DPC++/C++ Compiler
uses SDRAM as global memory. By default, the compiler configures global memory in a burst-interleaved configuration. The
Intel® oneAPI DPC++/C++ Compiler
interleaves global memory across each of the external memory banks.
In most circumstances, the default burst-interleaved configuration leads to the best load balancing between memory banks. However, in some cases, you might want to partition the banks manually as two non-interleaved (and contiguous) memory regions to achieve better load balancing.
The following figure illustrates the difference in memory mapping patterns between burst-interleaved and non-interleaved memory partitions:
Global Memory Partitions
Global Memory Partitions

Manual Partition of Global Memory

You can partition the memory manually so that each buffer occupies a different memory bank.
The default burst-interleaved configuration of the global memory prevents load imbalance by ensuring that memory accesses do not favor one external memory bank over another. However, you have the option to control the memory bandwidth across a group of buffers by partitioning your data manually.
The 
Intel® oneAPI DPC++/C++ Compiler
cannot burst-interleave across different memory types. To manually partition a specific type of global memory, compile your DPC++ kernels with the 
-Xsno-interleaving=<global_memory_type>
 flag to configure each bank of a certain memory type as non-interleaved banks.
If your kernel accesses two buffers of equal size in memory, you can distribute your data to both memory banks simultaneously regardless of dynamic scheduling between the loads. This optimization step might increase your apparent memory bandwidth.
If your kernel accesses heterogeneous global memory types, include the 
-Xsno-interleaving=<global_memory_type>
 option in the 
clang++
command for each memory type that you want to partition manually.

Partitioning Buffers Across Different Memory Types (Heterogeneous Memory)

The board support package for your FPGA board can assemble a global memory space consisting of different memory technologies (for example, DRAM or SRAM). The board support package designates one such memory (which might consist of multiple interfaces) as the default memory. All buffers reside in this heterogeneous memory.
To use the heterogeneous memory, perform the following steps to modify the code in your source file:
  1. Determine the names of global memory types available on your FPGA board using one of the following methods:
    • Refer to the board vendor's documentation for information.
    • Identify the index of the global memory type in the
      board_spec.xml
      file of your Custom Platform. The index starts at 0 and follows the order in which the global memory appears in the
      board_spec.xml
      . For example, the first global memory type in the XML file has an index 0, the second has an index 1, and so on. For more information, refer to
      global_mem
      in the
      Intel® FPGA SDK for OpenCL™ Pro Edition Custom Platform Toolkit User Guide
      .
  2. To instruct the host to allocate a buffer to a specific global memory type, insert the
    buffer_location
    <index>
    attribute in the accessor's property list.
    For example:
    sycl::ONEAPI::accessor_property_list PL{sycl::INTEL::buffer_location<2>}; sycl::accessor accessor(buffer, cgh, sycl::read_only, PL);
If you do not specify the
buffer_location
attribute, the host allocates the buffer to the default memory type automatically. To determine the default memory type, consult the documentation provided by your board vendor. Alternatively, in the
board_spec.xml
file of your Custom Platform, search for the memory type that is defined first or has the attribute
default=1
assigned to it. For more information, refer to
Intel® FPGA SDK for OpenCL™ Pro Edition Custom Platform Toolkit User Guide
.

Contiguous Memory Accesses

The
Intel® oneAPI DPC++/C++ Compiler
attempts to dynamically coalesce accesses to adjacent memory locations to improve global memory efficiency. This is effective if consecutive work items access consecutive memory locations in a given load or store operation. The same is true in a
single_task
invocation if consecutive loop iterations access consecutive memory locations.
Consider the following code example:
void sum(nd_item<1> it, accessor<int, access::mode::read, access::target::global_buffer> a, accessor<int, access::mode::read, access::target::global_buffer> b, accessor<int, access::mode::write, access::target::global_buffer> c) { int gid = it.get_global_id(0); c[gid] = a[gid] + b[gid]; }
The load operation from array 
a
 uses an index that is a direct function of the work-item global ID. By basing the array index on the work-item global ID, the
Intel® oneAPI DPC++/C++ Compiler
can ensure contiguous load operations. These load operations retrieve the data sequentially from the input array and send the read data to the pipeline as required. Contiguous store operations then store elements of the result that exits the computation pipeline in sequential locations within global memory.
The following figure illustrates an example of the contiguous memory access optimization:
Contiguous Memory Access
Contiguous Memory Access
Contiguous load and store operations improve memory access efficiency because they lead to increased access speeds and reduced hardware resource needs. The data travels in and out of the computational portion of the pipeline concurrently, allowing overlaps between computation and memory accesses. Where possible, use work-item IDs that index accesses to arrays in global memory to maximize memory bandwidth efficiency.

Static Memory Coalescing

Static memory coalescing is an 
Intel® oneAPI DPC++/C++ Compiler
optimization step that merges contiguous accesses to global memory into a single wide access. A similar optimization is applied to on-chip memory.
The figure below shows a common case where kernel performance might benefit from static memory coalescing:
Static Memory Coalescing
Static Memory Coalescing
Consider the following vectorized kernel:
void coalesced( accessor<int, access::mode::read, access::target::global_buffer> a, accessor<int, access::mode::read, access::target::global_buffer> b) { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[i]; }
The DPC++ kernel performs four load operations from buffer a that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the compiler coalesces the four loads into a single, wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the compiler performs static memory coalescing automatically, you should use wide vector loads and stores in your DPC++ code whenever possible to ensure efficient memory accesses.
To allow static memory coalescing, you must write your code in such a way that the compiler can identify a sequential access pattern during compilation. The original kernel code shown in the figure above can benefit from static memory coalescing because all indexes into buffers 
a
 and 
b
 increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:
void not_coalesced( accessor<int, access::mode::read, access::target::global_buffer> a, accessor<int, access::mode::write, access::target::global_buffer> b, accessor<int, access::mode::read, access::target::global_buffer> offsets) { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[offsets[i]]; }
The value 
offsets[i]
 is unknown at compilation time. As a result, the
Intel® oneAPI DPC++/C++ Compiler
cannot statically coalesce the read accesses to buffer 
a
.
For more information, refer to Local and Private Memory Accesses Optimization.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804