Strange crash... access to __local memory

Strange crash... access to __local memory

Hi,I have implement a 'scan' algorithm in OpenCL. Because it is an open source library I test in on several machines and OpenCL-SDK.But, it crash with the Intel SDK ! (Not with the other ones).What I have discover is that the problem is maybe related to the "__local" memory !I have the following kernel : __kernelvoid kernel__ExclusivePrefixScan(...,__local T* localBuffer,...)And I set up my buffer with the following command : clStatus = clSetKernelArg(_kernel_Scan, 2, _workgroupSize * 2 * sizeof(int), 0); checkCLStatus(clStatus); <= CL_SUCCESS !!!Where _workgroupSize= 128; So, I reserve 1024 bytes only !!!You can find the code at :

22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Is the crash on compile or execution of kernel?
I looked at the link you gave and didnt find anything in downloads tab.
Where should we look for sources?
Thanks, Shiri

Hi Shiri,The crash is when executing the kernel.You can download the code using SVN at the following address :

Hi Shiri,Have you take a look ? I'm still unable to fix it with Intel SDK !!!!!!!

Your input is very important and we are investigating it.
I will return to you with our findings.

Best Reply


I investigatedon rev41.

Well, the issue here is not the __local memory but the memory overrun during write in line:
line 175, : blockSums[bid] = localBuffer[localBufferFullSize-1];

Looking on the host code I saw that you allocate memory buffer that is not sufficient for the operation of the algorithm.

One of the issues is buffer size calculation (line 219, clppScan.cpp). You are use workgroup size of 128, while providing local size of 64 (line 72) to NDRange. Thus, causes number of workgroups to be greater than size of the allocated buffer and as a result you have memory overrun.

After the change first NDRange passed, I added clFinish() after it, but then the next NDRange failed. This is because the same reason. The intermidiate buffer size doesn't match the number of work groups, probably you should decrease the global size in the next pass.


Thanks a lot,

You're right, I have miss theses error. First I was using the wrong size for my buffers and secondly I have forgot to put the work-size into the loop !

So, no I have no more crash but still unable to get the correct values :-P

Hope that a day we will have an debugger for visual studio by example :-p

probably debuger will help.
For now i can only advice you to use printf().

I have fixed the bug. At least with the AMD SDK, on CPU and GPU !

But not with the intel sdk :-(

I don't understand how I can have correct result in AMD but not with Intel, if the error is on my side !
Do you have an advice about this ? Or maybe there is an issue with the intel SDK ?

Also, it will be interesting to use the AVX instructions, I have see that some peoples are using float4 to scan, maybe it will be great to have a version that is aware of the new AVX instructions set ?

I have notice something strange with the Intel SDK.Sometimes (rarely) it gives me the right values...But most of the time not !So, I have check all the values manually and most of the time the scan is 2x faster than with the AMD SDK. (So, either the Intel SDK is really really fast, either we don't process everything... it is what I suspect !)Even, when the results are correct with the Intel SDK... the time is the same than with the AMD SDK !The problem seems to be in the "addition" phase (I suppose for now). A lot of values are correct, but not everything !Do you think that you can help me again for this problem ? It is strange to have random results only with the Intel SDK (NVidia and AMD are working without problems).RegardsKrys

Hi Krys,

I propose you to check for the event status after NDRange completion. Is it SUCCESS, or has an error?

Do you have C reference code that validates the result?


HiEvgeny,Yes, I already check the NDRange status, and everything return me a CL_SUCCESS.To check the result, I have a simple C++ scan algorithm and then I compare the 2 results.What is strange is that I only got this problem with the Intel SDK and even sometimes the results are correct !!It sounds like a SDK bug... but maybe it is something else !!!ThanksKrys

Sounds great,

Could you please add this validation code to the project and report the validation result?


Hi Evgeny,

The validation code is already on the benchmark.cpp class. It is just a scan in c++ (very simple) and then I compare the 2 results !

The results are wrong with the Intel SDK !!


The C++ code does not use the same algorithm, it just create the same result.

So, I think that there is some error in the Intel SDK, because NVidia GPU, AMD GPU+CPU are giving me the right results. And even, sometimes Intel give me the right result. So, the result with the Intel SDK are RANDOM!!! It is not normal !!!!

I know that you have expert in scan algorithm at Intel (Intel Parallel primitives), so maybe they can help you ? It is just an idea !!

Right ?


From your question it sounds like you're utilising local memory in your algorithm. Local memory has some surprising properties according to the OpenCL spec, and tends to not behave the same between different vendor implementations.

What you could try to hunt for is a missing barrier() built-in instruction. For the sake of debugging, add a barrier (CLK_LOCAL_MEM) after every write to local memory and before every read. It will hurt performance, but it will add some predictability to the kernel behaviour. If the SDK works 100% okay after this modification, consider maybe one of these barriers was actually required (remember: read after write is not guaranteed to be up to date if you don't issue a mem_fence or a barrier).
If this doesn't solve the problem, you could try an additional debug step of running everything in a single work-group, by defining the local size as equal to the global size. You could also try disabling the vectorization module by using the vec_type_hint (see the optimization guide for more details on this) and if after all of these steps the SDK's behaviour is still unpredictable, please try and create a reproduction and send it to us.

Doron Singer

Thanks for your councils,

So, I have isolate the problem.

The scan is done in 2 phase, the second phase is the "uniform addition", a simple way to write it is :

uint gid = get_global_id(0) * 2;

const uint blockId = get_group_id(0);

output[gid] += blockSums[blockId];

output[gid+1] += blockSums[blockId];

This version works, but is slow. So, I try the following :

uint gid = get_global_id(0) * 2;

const uint tid = get_local_id(0);

const uint blockId = get_group_id(0);

__local T localBuffer[1];

if (tid < 1)

localBuffer[0] = blockSums[blockId];


if (gid < outputSize)

output[gid] += localBuffer[0];


if (gid < outputSize)

output[gid] += localBuffer[0];

And then, I have the problem ! I use a local buffer that is initialized only when tid < 1.BUT this local buffer is also available to the other work-items and after the barrier I should use the same value for all the work-items.

So, for me the bug is in the Intel SDK, the "__local T localBuffer[1];" is not shared between all the work items ! It is the difference between "__local int b[1];" and "int b[1];"

Do you agree that it is a SDK bug ?

Thanks for your help


We'll have a definitive answer by Sunday. Thanks for investing the time in getting to the bottom of this.

Hi Krys,

Could you please add validation phase to your code, against native C/C++?
Please report the validation result on program termination.


HiEvgeny,there are some validation !!! If the result is incorrect it write a message in the output "Algorithm FAILED : Scan" (See benchmark.cpp).Is it ok for you ?Krys

Hi Krys,

I was capable to reproduce the issue.
The fix will be published in the next public release.

Thank you again,

Great,Thanks for your support

Leave a Comment

Please sign in to add a comment. Not a member? Join today