Developer Guide

Contents

Prepinning Memory

You must consider how the transfer of data from the host to the device occurs when optimizing kernel memory accesses. For designs that have longer data transfer times than the compute time, the data transfer time may be a bottleneck. On devices supporting greater than a PCIe Gen3 x 8 transfer rate, prepinning the memory that is on the host prior to its transfer allows for it to transfer at a higher bandwidth. For example, on the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as
Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA
) that has a PCIe Gen3 x16 transfer rate, memory transfer with prepinning achieves approximately 12 GB/s in half-duplex and 21 GB/s in full-duplex.
The following is an example of how to allocate buffers on prepinned memory:
INTEL::fpga_selector device_selector; auto device_queue = queue(device_selector); int* in = malloc_host<int>(1024, device_queue.get_context()); auto buf_pinned = buffer<int, 1>(in, 1024);
Prepinned memory is allocated through the restricted unified shared memory (USM) call
malloc_host()
. Hence, a prepinned memory is available only on devices that support Restricted USM.
Pinned memory is a scarce resource on the system, so carefully consider which buffers you want to pin to avoid exceeding the system limit. In addition, pinning itself is an expensive operation, so for optimal performance, ensure that the creation of pinned buffers takes place outside the main compute loop.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.