Avoiding Handling Edge Conditions in Kernels

Consider this smoothing 2x2 filter:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int myX = get_global_id(0);
  const int myY = get_global_id(1);
  const int image_width = get_global_size(0);
  uint neighbors = 1;
  float sum = 0.0f;
  if ((myX + 1) < (image_width-1))
  {
    sum += input[myY * image_width + (myX + 1)];
    ++neighbors;
  }
  if (myX > 0)
  {
    sum += input[myY * image_width + (myX - 1)];
    ++neighbors;
  }
  if ((myY + 1) < (image_height-1))
  {
    sum += input[(myY + 1) * image_width + myX];
    ++neighbors;
  }
  if (myY > 0)
  {
    sum += input[(myY – 1) * image_width + myX];
    ++neighbors;
  }
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / (float)neighbors;
}

Assume that you have a full HD image with size of 1920x1080 pixels. The four edge if conditions are executed for every pixel, that is, roughly two million times.

However, they are only relevant for the 6000 pixels on the image edges, which make 0.2% of all the pixels. For the remaining 99.8% work-items, the edge condition check is a waste of time. Also compare how shorter and easier to perceive the following code, which does not perform any edge check:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int myX = get_global_id(0);
  const int myY = get_global_id(1);
  const int image_width = get_global_size(0);
  float sum = 0.0f;
  sum += input[myY * image_width + (myX + 1)];
  sum += input[myY * image_width + (myX - 1)];
  sum += input[(myY + 1) * image_width + myX];
  sum += input[(myY – 1) * image_width + myX];
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / 5.0f;
}

This code requires padding (enlarging) the input buffer appropriately, if using the original global size. This way querying the neighbors for the border pixels does not result in buffer overrun.

If padding through larger input is not possible, make sure you use the min and max built-in functions, so that checking a work-item does not access outside the actual image and adds only four lines:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int image_width = get_global_size(0);
  const int image_height = get_global_size(0);
  int myX = get_global_id(0);
  //since for myX== image_width–1 the (myX+1) is incorrect 
  myX =  min(myX, image_width -2);
  //since for myX==0 the (myX-1) is incorrect
  myX =  max(myX, 1);
  int myY = get_global_id(1);
  //since for myY== image_height-1 the (myY+1) is incorrect 
  myY =  min(myY, image_height -2);
  //since for myY==0 the (myY-1) is incorrect
  myY =  max(myY , 1);
  float sum = 0.0f;
  sum += input[myY * image_width + (myX + 1)];
  sum += input[myY * image_width + (myX - 1)];
  sum += input[(myY + 1) * image_width + myX];
  sum += input[(myY – 1) * image_width + myX];
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / 5.0f;
}

At a cost of duplicating calculations for border work-items this code avoids testing for the edge conditions, which is otherwise necessary to perform for the all work-items.

One more approach is to ignore the pixels on the edge, for example, by executing the kernel on a 1918x1078 sub-region within the buffer. OpenCL™ 1.2 and higher enables you to use global_work_offset parameter with clEnqueueNDRangeKernel to implement this behavior. However, use 1912 for first dimension of the global size, as 1918 is not a multiple of 8, which means potential underutilization of the SIMD units. Notice that OpenCL 2.0 offers “non-uniform work-groups” feature which handles global sizes that are not multiple of underlying SIMD in the very efficient way. Refer to the See Also section below for details.

Note

Using image types along with the appropriate sampler (CL_ADDRESS_REPEAT or CLAMP) also automates edge condition checks for data reads. Refer to the "Using Buffers and Images Appropriately" section for pros and contras of this approach.

See Also

Using Buffers and Images Appropriately
OpenCL™ 2.0 Non-Uniform Work-Groups

For more complete information about compiler optimizations, see our Optimization Notice.