Avoid Handling Edge Conditions in Kernels
To understand how to avoid handling edge conditions in kernels, consider
the following smoothing 2x2 filter:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); uint neighbors = 1; float sum = 0.0f; if ((myX + 1) < (image_width-1)) { sum += input[myY * image_width + (myX + 1)]; ++neighbors; } if (myX > 0) { sum += input[myY * image_width + (myX - 1)]; ++neighbors; } if ((myY + 1) < (image_height-1)) { sum += input[(myY + 1) * image_width + myX]; ++neighbors; } if (myY > 0) { sum += input[(myY – 1) * image_width + myX]; ++neighbors; } sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / (float)neighbors; }
Assume that you have a full HD image with size of 1920x1080 pixels.
The four edge
if
conditions are tested for every pixel, that
is, roughly two million times.However, they are only relevant for the 6000 pixels on the image edges,
which is 0.2% of the pixels. For the remaining 99.8% work-items, the edge
condition check is a waste of time. See the following optimized code:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
This code requires padding (enlarging) the input buffer appropriately,
while using the original global size. This way querying the neighbors
for the border pixels does not result in buffer overrun. When you process
several frames this way, do this padding exactly once (which means you
initially allocate frames of the proper size), instead of copying each
frame to larger buffer before processing, and then copying the result
back to smaller (original size) buffer.
If padding through larger input is not possible, make sure you use the
min
or max
built-ins, so checking a work-item
does not access outside the actual image adds only four lines. Consider
the following example:__kernel void smooth(const __global float* input, __global float* output) { const int image_width = get_global_size(0); const int image_height = get_global_size(0); int myX = get_global_id(0); //since for myX== image_width–1 the (myX+1) is incorrect myX = min(myX, image_width -2); //since for myX==0 the (myX-1) is incorrect myX = max(myX, 1); int myY = get_global_id(1); //since for myY== image_height-1 the (myY+1) is incorrect myY = min(myY, image_height -2); //since for myY==0 the (myY-1) is incorrect myY = max(myY - 1, 0); float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
At a cost of duplicating calculations for border work-items, this code
avoids testing for the edge conditions, which is otherwise needed to perform
for all work-items.