Microsoft Windows* 8

Memory Access Overview

Optimizing memory accesses is the first step to achieving high performance with OpenCL* on the Intel® Processor Graphics. Tune your kernel to access memory at an optimal granularity and with optimal addresses.

The OpenCL* implementation for the Intel® Processor Graphics primarily accesses memory through the following caches:

  • GPU-specific L3 cache
  • CPU and GPU shared Last Level Cache (LLC).

L1 and L2 caches are specific to the sampler and renderer.

Granularity

For all memory address spaces, to optimize performance, a kernel must access data in at least 32-bit quantities, from addresses that are aligned to 32-bit boundaries. A 32-bit quantity can consist of any type, for example:

__global Memory and __constant Memory

To optimize performance when accessing __global memory and __constant memory, a kernel must minimize the number of cache lines that are accessed.

However, if many work-items access the same global memory or constant memory array element, memory performance may be reduced.

For this reason, move frequently accessed global or constant data, such as look-up tables or filter coefficients, to local memory to improve performance.

__private Memory

Since each work-item has its own __private memory, there is no locality for __private memory accesses, and each work-item frequently accesses a unique cache line for every access to __private memory. For this reason, accesses to __private memory are very slow, and you should avoid indexed private memory if possible.

__local Memory

To optimize performance when accessing __local memory, a kernel must minimize the number of bank conflicts. As long as each work-item accesses __local memory with an address in a unique bank, the access occurs at full bandwidth. Work-items can read from the same address within a bank with no penalty, but writing to different addresses within the same bank produces a bank conflict and impacts performance.

Using Loops

The Intel® Processor Graphics device is optimized for code, which does not branch or loop. In the case, when a loop in a kernel is unavoidable, minimize the overhead by unrolling the loop either partially or completely in code, or using macros, and also minimize memory accesses within the loop.

Host-Side Timing

The following code snippet is a host-side timing routine around a kernel call (error handling is omitted):

float start = …;//getting the first time-stamp
        clEnqueueNDRangeKernel(g_cmd_queue, …);
        clFinish(g_cmd_queue);// to make sure the kernel completed
float end = …;//getting the last time-stamp
float time = (end-start);

In this example, host-side timing is implemented using the following functions:

Subscribe to Microsoft Windows* 8