Optimizing memory accesses is the first step to achieving high performance with OpenCL™ on the Intel® Graphics. Tune your kernel to access memory at an optimal granularity and with optimal addresses.
The OpenCL™ implementation for the Intel® Graphics primarily accesses global and constant memory through the following caches:
- GPU-specific L3 cache
- CPU and GPU shared Last Level Cache (LLC).
Of these two caches, it is important to optimize memory accesses for the L3 cache. L3 cache line is 64 bytes.
Finally, there are L1 and L2 caches that are specific to the sampler and renderer.
Accesses to __global memory and __constant memory go through the L3 cache and LLC. In addition, __private memory that spill from registers do the same. If multiple OpenCL work-items in the same hardware thread make requests to the same L3 cache line, these requests are collapsed to a single request. This means that the effective __global memory, __constant memory, and __private memory bandwidth is determined by the number of the accessed L3 cache lines that are accessed.
For example, if two L3 cache lines are accessed from different work items in the same hardware thread, memory bandwidth is one half of the memory bandwidth in case when only one L3 cache line is accessed.
__local memory is allocated directly from the L3 cache, and is divided into 16 banks at a 32-bit granularity. Because it is so highly banked, it is more important to minimize bank conflicts when accessing local memory than to minimize the number of L3 cache lines accesses.
All memory can be accessed in 8-bit, 16-bit, or 32-bit quantities. 32-bit quantities can be accessed as vectors of one, two, three, or four components.