Developer Guide


Kernel Memory

If you declare a private array, local array or local accessor in your source code, the
Intel® oneAPI DPC++/C++ Compiler
creates a kernel memory in hardware. Kernel memory is sometime referred to as
local memory
on-chip memory
because it is created from memory sources (such as RAM blocks) available on the FPGA. The following source code snippet illustrates both a kernel and a global memory and their accesses:
deviceQueue.submit([&](handler &cgh) { //creating accessor for device global memory from host-allocated buffer auto accessorA = bufferA.get_access<sycl_write>(cgh); cgh.single_task<class kernelCompute>([=]() { int T[SIZE]; // declaring a kernel memory for (unsigned i = 0; i < SIZE; i++) { T[i] = i; // writing to kernel memory } for (int i = 0; i < N; i+=2) { // reading from a kernel memory and writing to global memory // through its accessor accessorA[i] = T[i] + T[i+1]; } }); });
The compiler performs the following to build a memory system:
  • Builds a kernel memory from FPGA memory resources (such as block RAMs) and presents it to the datapath as a single memory.
  • Maps an array access to a load-store unit (LSU) in the datapath that transacts with the kernel memory through its ports.
  • Builds the kernel memory and LSUs and retains complete control over their structure.
  • Automatically optimizes the kernel memory geometry to maximize the bandwidth available to loads and stores in the datapath.
  • Attempts to guarantee that kernel memory accesses never stall.
These are discussed in detail in later sections of this guide.

Stallable and Stall-Free Memory Systems

Accesses to a memory (read or write) can be stall-free or stallable:
Memory Systems
Memory Access
A memory access is stall-free if it has contention-free access to a memory port. A memory system is stall-free if each of its memory operations has contention-free access to a memory port.
A memory access is stallable when two memory accesses want to access one memory port in the same clock cycle, one of the memory accesses is delayed (or stalled) until the memory port in contention is available.
As much as possible, the
Intel® oneAPI DPC++/C++ Compiler
attempts to create stall-free memory systems for your kernel.
A read or write is stall-free if it has contention-free access to a memory port, as shown in the following figure:
Examples of Stall-free and Stallable Memory Systems
Explaining a Stall-free and a Stallable Memory Systemg
The Figure 1 shows the following example memory systems:
  • A
    : A stall-free memory system
    This memory system is stall-free because, even though the reads are scheduled in the same cycle, they are mapped to different ports. There is no contention for accessing the memory systems.
  • B
    : A stall-free memory system
    This memory system is stall-free because the two reads are statically-scheduled to occur in different clock cycles. The two reads can share a memory port without any contention for the read access.
  • C
    : A stallable memory system
    This memory system is stallable because two reads are mapped to the same port in the same cycle. The two reads happen at the same time. These reads require collision arbitration to manage their port access requests, and arbitration can affect throughput.
A kernel memory system consists of the following parts:
A memory
corresponds to either a read-operation or a write-operation in the datapath. A port is connected to one or more load-store unit (LSU). An LSU can be connected to multiple ports.
A memory
is a division of the kerney memory system that contains a subset of the data stored. That is, all of the data stored for a kernel is split across banks, with each bank containing a unique piece of the stored data.
A memory system always has at least one bank.
A memory bank
is a copy of the data in the memory bank with its own ports. All replicates in a bank contain the same data. Each replicate can be accessed independent of the others
A memory bank always has at least one replicate.
Private Copy
private copy
is a copy of the data in a replicate that is created for nested loops to enable concurrent iterations of the outer loop.
can comprise multiple private copies, with each iteration of an outer loop having its own private copy. Because each outer loop iteration has its own private copy, private copies are not expected to contain the same data.
The following figure illustrates the relationship between banks, replicates, ports, and private copies:
Schematic Representation of Kernel Memories Showing the Relationship between Banks, Replicates, Ports, and Private Copies
Schematic Representation of Local Memories Showing the Relationship between Banks, Replicates, Ports, and Private Copies

Strategies That Enable Concurrent Stall-Free Memory Accesses

The compiler uses a variety of strategies to ensure that concurrent accesses are stall-free including:
Despite the compiler’s best efforts, the kernel memory system can still be stallable. This might happen due to resource constraints or memory attributes defined in your source code. In that case, the compiler tries to minimize the hardware resources consumed by the arbitrated memory system.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804