Prefer Row-Wise Data Accesses
OpenCL™ enables you to submit kernels on one-, two-, or three-dimensional
index space. Consider using one-dimensional ranges for reasons of cache
locality and saving index computations.
If two- or three-dimensional range naturally fits your data dimensions,
try to keep work-items scanning along rows, not columns. For example,
the following code is not optimized (it might trigger gathers instructions):
__kernel void smooth(__constant float* input, uint image_width, uint image_height, __global float* output) { int myX = get_global_id(1); int myY = get_global_id(0); int myPixel = myY * image_width + myX; float data = input[myPixel]; … }
In this code example, the image height is the first dimension and the
image width is the second dimension. The resulting column-wise data access
is inefficient, since Intel® OpenCL™ implementation initially iterates
over the first dimension.
Below is more optimal version, because of more memory-friendly (sequential)
access.
__kernel void smooth(__constant float* input, uint image_width, uint image_height, __global float* output) { int myX = get_global_id(0); int myY = get_global_id(1); int myPixel = myY * image_width + myX; float data = input[myPixel]; … }
In the example above, the first dimension is the image width and the
second is the image height.
The same rule applies if each work-item calculates several elements.
To optimize performance, make sure work-items read from consecutive memory
addresses.
Finally, if you run two-dimensional NDRange, prefer the data access
to be consecutive along dimension zero.