• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Avoiding Handling Edge Conditions in Kernels

Consider this smoothing 2x2 filter:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); uint neighbors = 1; float sum = 0.0f; if ((myX + 1) < (image_width-1)) { sum += input[myY * image_width + (myX + 1)]; ++neighbors; } if (myX > 0) { sum += input[myY * image_width + (myX - 1)]; ++neighbors; } if ((myY + 1) < (image_height-1)) { sum += input[(myY + 1) * image_width + myX]; ++neighbors; } if (myY > 0) { sum += input[(myY – 1) * image_width + myX]; ++neighbors; } sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / (float)neighbors; }
Assume that you have a full HD image with size of
1920x1080
pixels. The four edge
if
conditions are executed for every pixel, that is, roughly two million times.
However, they are only relevant for the
6000
pixels on the image edges, which make 0.2% of all the pixels. For the remaining 99.8% work-items, the edge condition check is a waste of time. Also compare how shorter and easier to perceive the following code, which does not perform any edge check:
__kernel void smooth(const __global float* input, __global float* output) { const int myX = get_global_id(0); const int myY = get_global_id(1); const int image_width = get_global_size(0); float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
This code requires padding (enlarging) the input buffer appropriately, if using the original global size. This way querying the neighbors for the border pixels does not result in buffer overrun.
If padding through larger input is not possible, make sure you use the
min
and
max
built-in functions, so that checking a work-item does not access outside the actual image and adds only four lines:
__kernel void smooth(const __global float* input, __global float* output) { const int image_width = get_global_size(0); const int image_height = get_global_size(0); int myX = get_global_id(0); //since for myX== image_width–1 the (myX+1) is incorrect
myX = min(myX, image_width -2);
//since for myX==0 the (myX-1) is incorrect
myX = max(myX, 1);
int myY = get_global_id(1); //since for myY== image_height-1 the (myY+1) is incorrect
myY = min(myY, image_height -2);
//since for myY==0 the (myY-1) is incorrect
myY = max(myY , 1);
float sum = 0.0f; sum += input[myY * image_width + (myX + 1)]; sum += input[myY * image_width + (myX - 1)]; sum += input[(myY + 1) * image_width + myX]; sum += input[(myY – 1) * image_width + myX]; sum += input[myY * image_width + myX]; output[myY * image_width + myX] = sum / 5.0f; }
At a cost of duplicating calculations for border work-items this code avoids testing for the edge conditions, which is otherwise necessary to perform for the all work-items.
One more approach is to ignore the pixels on the edge, for example, by executing the kernel on a
1918x1078
sub-region within the buffer. OpenCL™ 1.2 and higher enables you to use
global_work_offset
parameter with
clEnqueueNDRangeKernel
to implement this behavior. However, use
1912
for first dimension of the global size, as
1918
is not a multiple of 8, which means potential underutilization of the SIMD units. Notice that OpenCL 2.0 offers “non-uniform work-groups” feature which handles global sizes that are not multiple of underlying SIMD in the very efficient way. Refer to the
See Also
section below for details.
Note
Using image types along with the appropriate sampler (
CL_ADDRESS_REPEAT
or
CLAMP
) also automates edge condition checks for data reads. Refer to the "Using Buffers and Images Appropriately" section for pros and contras of this approach.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804