Avoiding Handling Edge Conditions in Kernels

OpenCL™ Developer Guide for Intel® Processor Graphics

Download PDF

ID 773088

Date 3/20/2019

Version 2019.4

Public

Visible to Intel only — GUID: GUID-7FA272F8-40D6-4717-BDDD-2D7E0EA5E595

View Details

Avoiding Handling Edge Conditions in Kernels

Consider this smoothing 2x2 filter:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int myX = get_global_id(0);
  const int myY = get_global_id(1);
  const int image_width = get_global_size(0);
  uint neighbors = 1;
  float sum = 0.0f;
  if ((myX + 1) < (image_width-1))
  {
    sum += input[myY * image_width + (myX + 1)];
    ++neighbors;
  }
  if (myX > 0)
  {
    sum += input[myY * image_width + (myX - 1)];
    ++neighbors;
  }
  if ((myY + 1) < (image_height-1))
  {
    sum += input[(myY + 1) * image_width + myX];
    ++neighbors;
  }
  if (myY > 0)
  {
    sum += input[(myY – 1) * image_width + myX];
    ++neighbors;
  }
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / (float)neighbors;
}

Assume that you have a full HD image with size of 1920x1080 pixels. The four edge if conditions are executed for every pixel, that is, roughly two million times.

However, they are only relevant for the 6000 pixels on the image edges, which make 0.2% of all the pixels. For the remaining 99.8% work-items, the edge condition check is a waste of time. Also compare how shorter and easier to perceive the following code, which does not perform any edge check:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int myX = get_global_id(0);
  const int myY = get_global_id(1);
  const int image_width = get_global_size(0);
  float sum = 0.0f;
  sum += input[myY * image_width + (myX + 1)];
  sum += input[myY * image_width + (myX - 1)];
  sum += input[(myY + 1) * image_width + myX];
  sum += input[(myY – 1) * image_width + myX];
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / 5.0f;
}

This code requires padding (enlarging) the input buffer appropriately, if using the original global size. This way querying the neighbors for the border pixels does not result in buffer overrun.

If padding through larger input is not possible, make sure you use the min and max built-in functions, so that checking a work-item does not access outside the actual image and adds only four lines:

__kernel void smooth(const __global float* input, 
                     __global float* output)
{
  const int image_width = get_global_size(0);
  const int image_height = get_global_size(0);
  int myX = get_global_id(0);
  //since for myX== image_width–1 the (myX+1) is incorrect
 myX =  min(myX, image_width -2);
  //since for myX==0 the (myX-1) is incorrect
 myX =  max(myX, 1);
  int myY = get_global_id(1);
  //since for myY== image_height-1 the (myY+1) is incorrect
 myY =  min(myY, image_height -2);
  //since for myY==0 the (myY-1) is incorrect
 myY =  max(myY , 1);
  float sum = 0.0f;
  sum += input[myY * image_width + (myX + 1)];
  sum += input[myY * image_width + (myX - 1)];
  sum += input[(myY + 1) * image_width + myX];
  sum += input[(myY – 1) * image_width + myX];
  sum += input[myY * image_width + myX];
  output[myY * image_width + myX] = sum / 5.0f;
}

At a cost of duplicating calculations for border work-items this code avoids testing for the edge conditions, which is otherwise necessary to perform for the all work-items.

One more approach is to ignore the pixels on the edge, for example, by executing the kernel on a 1918x1078 sub-region within the buffer. OpenCL™ 1.2 and higher enables you to use global_work_offset parameter with clEnqueueNDRangeKernel to implement this behavior. However, use 1912 for first dimension of the global size, as 1918 is not a multiple of 8, which means potential underutilization of the SIMD units. Notice that OpenCL 2.0 offers “non-uniform work-groups” feature which handles global sizes that are not multiple of underlying SIMD in the very efficient way. Refer to the See Also section below for details.

NOTE:

Using image types along with the appropriate sampler (CL_ADDRESS_REPEAT or CLAMP) also automates edge condition checks for data reads. Refer to the "Using Buffers and Images Appropriately" section for pros and contras of this approach.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

OpenCL™ Developer Guide for Intel® Processor Graphics

Avoiding Handling Edge Conditions in Kernels

See Also