You must consider how the transfer of data from the host to the device occurs when optimizing kernel memory accesses. For designs that have longer data transfer times than the compute time, the data transfer time may be a bottleneck. On devices supporting greater than a PCIe Gen3 x 8 transfer rates, prepinning the memory that is on the host prior to its transfer allows for it to transfer at a higher bandwidth. For example, on the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as
Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA
) that has a PCIe Gen3 x16 transfer rate, memory transfer with prepinning achieves approximately 12 GB/s in half-duplex and 21 GB/s in full-duplex.
The following is an example of how to allocate buffers on prepinned memory:
auto device_queue = queue(device_selector);
int* in = malloc_host<int>(1024, device_queue.get_context());
auto buf_pinned = buffer<int, 1>(in, 1024);
Pinned memory is a scarce resource on the system, so carefully consider which buffers you want to pin to avoid exceeding the system limit. In addition, pinning itself is an expensive operation, so for optimal performance, ensure that the creation of pinned buffers takes place outside the main compute loop.