• 10/30/2018
  • Public Content
Contents

Avoid Handling Edge Conditions in Kernels

To understand how to avoid handling edge conditions in kernels, consider the following smoothing 2x2 filter:
__kernel void smooth(const __global float* input,                      __global float* output) {   const int myX = get_global_id(0);   const int myY = get_global_id(1);   const int image_width = get_global_size(0);   uint neighbors = 1;   float sum = 0.0f;   if ((myX + 1) < (image_width-1))   {     sum += input[myY * image_width + (myX + 1)];     ++neighbors;   }   if (myX > 0)   {     sum += input[myY * image_width + (myX - 1)];     ++neighbors;   }   if ((myY + 1) < (image_height-1))   {     sum += input[(myY + 1) * image_width + myX];     ++neighbors;   }   if (myY > 0)   {     sum += input[(myY – 1) * image_width + myX];     ++neighbors;   }   sum += input[myY * image_width + myX];   output[myY * image_width + myX] = sum / (float)neighbors; }
Assume that you have a full HD image with size of 1920x1080 pixels. The four edge
if
conditions are tested for every pixel, that is, roughly two million times.
However, they are only relevant for the 6000 pixels on the image edges, which is 0.2% of the pixels. For the remaining 99.8% work-items, the edge condition check is a waste of time. See the following optimized code:
__kernel void smooth(const __global float* input,                      __global float* output) {   const int myX = get_global_id(0);   const int myY = get_global_id(1);   const int image_width = get_global_size(0);   float sum = 0.0f;   sum += input[myY * image_width + (myX + 1)];   sum += input[myY * image_width + (myX - 1)];   sum += input[(myY + 1) * image_width + myX];   sum += input[(myY – 1) * image_width + myX];   sum += input[myY * image_width + myX];   output[myY * image_width + myX] = sum / 5.0f; }
This code requires padding (enlarging) the input buffer appropriately, while using the original global size. This way querying the neighbors for the border pixels does not result in buffer overrun. When you process several frames this way, do this padding exactly once (which means you initially allocate frames of the proper size), instead of copying each frame to larger buffer before processing, and then copying the result back to smaller (original size) buffer.
If padding through larger input is not possible, make sure you use the
min
or
max
built-ins, so checking a work-item does not access outside the actual image adds only four lines. Consider the following example:
__kernel void smooth(const __global float* input,                      __global float* output) {   const int image_width = get_global_size(0);   const int image_height = get_global_size(0);   int myX = get_global_id(0);   //since for myX== image_width–1 the (myX+1) is incorrect   myX =  min(myX, image_width -2);   //since for myX==0 the (myX-1) is incorrect   myX =  max(myX, 1);   int myY = get_global_id(1);   //since for myY== image_height-1 the (myY+1) is incorrect   myY =  min(myY, image_height -2);   //since for myY==0 the (myY-1) is incorrect   myY =  max(myY - 1, 0);   float sum = 0.0f;   sum += input[myY * image_width + (myX + 1)];   sum += input[myY * image_width + (myX - 1)];   sum += input[(myY + 1) * image_width + myX];   sum += input[(myY – 1) * image_width + myX];   sum += input[myY * image_width + myX];   output[myY * image_width + myX] = sum / 5.0f; }
At a cost of duplicating calculations for border work-items, this code avoids testing for the edge conditions, which is otherwise needed to perform for all work-items.

See Also

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804