Developer Guide


Kernel Memory

If you declare a private array, local array or local accessor in your source code, the
Intel® oneAPI
creates a kernel memory in hardware. Kernel memory is sometime referred to as
local memory
on-chip memory
because it is created from memory sources (such as RAM blocks) available on the FPGA. The following source code snippet illustrates both a kernel and a global memory and their accesses:
deviceQueue.submit([&](handler &cgh) { //creating accessor for device global memory from host-allocated buffer auto accessorA = bufferA.get_access<sycl_write>(cgh); cgh.single_task<class kernelCompute>([=]() { int T[SIZE]; // declaring a kernel memory for (unsigned i = 0; i < SIZE; i++) { T[i] = i; // writing to kernel memory } for (int i = 0; i < N; i+=2) { // reading from a kernel memory and writing to global memory // through its accessor accessorA[i] = T[i] + T[i+1]; } }); });
The compiler performs the following to build a memory system:
  • Maps each array access to a load-store unit (LSU) in the datapath that transacts with the kernel memory through its ports.
  • Builds the kernel memory and LSUs and retains complete control over their structure.
  • Automatically optimizes the kernel memory geometry to maximize the bandwidth available to loads and stores in the datapath.
  • Attempts to guarantee that kernel memory accesses never stall.
These are discussed in detail in later sections of this guide.

Stallable and Stall-Free Memory Systems

Accesses to a memory (read or write) can be stall-free or stallable:
Memory Systems
Memory Access
A memory access is stall-free if it has contention-free access to a memory port. This is illustrated in Figure 1. A memory system is stall-free if each of its memory operations has contention-free access to a memory port.
A memory access is stallable if it does not have contention free access to a memory port. When two datapath LSUs attempt to transact with a memory port in the same clock cycle, one of those memory accesses is delayed (or stalled) until the memory port in contention becomes available.
As much as possible, the
Intel® oneAPI
attempts to create stall-free memory systems for your kernel.
A read or write is stall-free if it has contention-free access to a memory port, as shown in the following figure:
Examples of Stall-free and Stallable Memory Systems
Explaining a Stall-free and a Stallable Memory Systemg
The Figure 1 shows the following example memory systems:
  • A
    : A stall-free memory system
    This memory system is stall-free because, even though the reads are scheduled in the same cycle, they are mapped to different ports. There is no contention for accessing the memory systems.
  • B
    : A stall-free memory system
    This memory system is stall-free because the two reads are statically-scheduled to occur in different clock cycles. The two reads can share a memory port without any contention for the read access.
  • C
    : A stallable memory system
    This memory system is stallable because two reads are mapped to the same port in the same cycle. The two reads happen at the same time. These reads require collision arbitration to manage their port access requests, and arbitration can affect throughput.
A kernel memory system consists of the following parts:
A memory
is a physical access point into a memory. A port is connected to one or more load-store units (LSUs) in the datapath. An LSU can connect to one or more ports. A port can have one or more LSUs connected.
A memory
is a division of the kernel memory system that contains a subset of the data stored. That is, all of the data stored for a kernel is split across banks, with each bank containing a unique piece of the stored data.
A memory system always has at least one bank.
A memory bank
is a copy of the data in the memory bank with its own ports. All replicates in a bank contain the same data. Each replicate can be accessed independent of the others
A memory bank always has at least one replicate.
Private Copy
private copy
is a copy of the data in a replicate that is created for nested loops to enable concurrent iterations of the outer loop.
can comprise multiple private copies, with each iteration of an outer loop having its own private copy. Because each outer loop iteration has its own private copy, private copies are not expected to contain the same data.
The following figure illustrates the relationship between banks, replicates, ports, and private copies:
Schematic Representation of Kernel Memories Showing the Relationship between Banks, Replicates, Ports, and Private Copies
Schematic Representation of Local Memories Showing the Relationship between Banks, Replicates, Ports, and Private Copies

Strategies That Enable Concurrent Stall-Free Memory Accesses

The compiler uses a variety of strategies to ensure that concurrent accesses are stall-free including:
Despite the compiler’s best efforts, the kernel memory system can still be stallable. This might happen due to resource constraints or memory attributes defined in your source code. In that case, the compiler tries to minimize the hardware resources consumed by the arbitrated memory system.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at