• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Kernel Memory Access Optimization Summary

A kernel should access at least 32-bits of data at a time, from addresses that are aligned to 32-bit boundaries. A
char4
,
short2
,
int
, or
float
counts as
32
-bits of data. If you can, load two, three, or four
32
-bit quantities at a time, which may improve performance. Loading more than four
32
-bit quantities at a time may reduce performance.
Optimize
__global
memory and
__constant
memory accesses to minimize the number of cache lines read from the L3 cache. This typically involves carefully choosing your work-group dimensions, and how your array indices are computed from the work-item local or global id.
If you cannot access
__global
memory or
__constant
memory in an optimal manner, consider moving part of your data to
__local
memory, where more access patterns can execute with full performance.
Local memory is most beneficial when the access pattern favors the banked nature of the SLM hardware.
Optimize
__local
memory accesses to minimize the number of bank conflicts. Reading the same address from the same bank is OK, but reading different addresses from the same bank results in a bank conflict. Writes to the same bank always result in a bank conflict, even if the writes are going to the same address. Consider adding a column to two-dimensional local memory arrays if it avoids bank conflicts when accessing columns of data.
Avoid dynamically-indexed
__private
arrays if possible.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.