Incorrect kernel execution with barrier(CLK_LOCAL_MEM_FENCE)

Incorrect kernel execution with barrier(CLK_LOCAL_MEM_FENCE)

Consider the following kernel:

__kernel void test(__global float2 *output, __global float2 *input)
{
    __local float lmem[8];
    float2 a;
    const size_t tid = get_global_id(0);
    if(tid / 8 == 0)
    {
        a = input[tid];
    }
    else
    {
        return;
    }
    lmem[tid] = -a.x;
    barrier(CLK_LOCAL_MEM_FENCE);
    a.x = lmem[tid];
    barrier(CLK_LOCAL_MEM_FENCE);
    output[tid] = a;
}

If I execute it with global size == local_size == 16 and pass an array of 16 float2 elements as input:

input = [  0.+0.j   1.+0.j   2.+0.j   3.+0.j   4.+0.j   5.+0.j   6.+0.j   7.+0.j
   8.+0.j   9.+0.j  10.+0.j  11.+0.j  12.+0.j  13.+0.j  14.+0.j  15.+0.j]

and a zero-filled buffer as output, I expect the first 8 elements of the output to have their real parts negated in the output array, while the rest of it remaining untouched:

output = [-0.+0.j -1.+0.j -2.+0.j -3.+0.j -4.+0.j -5.+0.j -6.+0.j -7.+0.j 0.+0.j
  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j]

This is what happens on Ubuntu 12.04 x64, nVidia CUDA 5 platform, Tesla C2050 device. But on the same operating system, Intel OpenCL XE SDK 2013 3.0.67279, and Intel Xeon E5620 the whole resulting buffer remains untouched:

output = [ 0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j
  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j]

The output coincides with the reference CUDA output if I do any of the following:

  1. Comment out the barriers;
  2. Use float arrays instead of float2
  3. Initialize "a" inside the kernel instead of reading it from input (i.e. as "a = (float2)(tid, 0)").

Has anyone encountered such behavior? Is it a bug, or am I making incorrect assumptions about how barriers work?

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Hello vladimirsson!

According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:

This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined. 

Thank you, I must have missed that part in the specs. Something to look out for in other kernels too.

Leave a Comment

Please sign in to add a comment. Not a member? Join today