Developer Guide

Contents

Load-Store Unit Styles

The
Intel® oneAPI
DPC++/C++
Compiler
generates different styles of load-store units (LSUs) based on:
  • Inferred memory access pattern
  • Types of memory available on the target platform
  • Whether the memory accesses are to the local or global memory
The
Intel® oneAPI
DPC++/C++
Compiler
can generate the following styles of LSUs:

Burst-Coalesced Load-Store Units

A burst-coalesced LSU is the default LSU style instantiated by the compiler for accessing global memory. It buffers contiguous memory requests until the largest possible burst can be made. For noncontiguous memory requests, a burst-coalesced LSU flushes the buffer between requests.
While a burst-coalesced LSU provides efficient, variable-latency access to global memory, a burst-coalesced LSU requires a considerable amount of FPGA resources.
The following example code results in the compiler instantiating burst-coalesced LSUs:
cgh.single_task<class Kernel>([=] { auto input_ptr = input_accessor.get_pointer(); auto output_ptr = output_accessor.get_pointer(); int x = input_ptr[RandomIndex]; //burst-coalesced output_ptr[0] = x; });
Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU in the following ways:

Prefetching Load-Store Units

A prefetching LSU instantiates a FIFO that burst-reads large memory blocks to keep the FIFO full of valid data based on the previous address and assumes contiguous reads. Noncontiguous reads are supported, but a penalty is incurred to flush and refill the FIFO. A prefetching LSU is inferred only for nonvolatile global pointers.
The following example code results in the compiler instantiating prefetching LSUs to access global memory:
cgh.single_task<class Kernel>([=] { auto input_ptr = input_accessor.get_pointer(); auto output_ptr = output_accessor.get_pointer(); int x = 1; for (int i=0; i< VectorSize; i++) { x = x + input_ptr[i]; //prefetching } output_ptr[0] = x; });

Pipelined Load-Store Units

A pipelined LSU is used for accessing local memory. Memory requests are submitted immediately after they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a pipelined never-stall LSU is created.
cgh.single_task<class Kernel>([=] { auto input_ptr = input_accessor.get_pointer(); auto output_ptr = output_accessor.get_pointer(); const unsigned LMEM_SIZE = 128; int lmem[LMEM_SIZE]; for (int i=0; i< LMEM_SIZE; i++) { lmem[i]=i*100; //pipelined } output_ptr[0]=lmem[input_ptr[0]]; });
The compiler might modify a local-pipelined LSU as a never-stall LSU. For more details, refer to Never-stall.
The
Intel® oneAPI
DPC++/C++
Compiler
may also infer a pipelined LSU for global memory accesses that can be proven to be infrequent. The compiler uses a pipelined LSU for such accesses because a pipelined LSU is smaller than other LSU styles. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because memory accesses are infrequent.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.