I found that reduction algorithm from NVidia SDK works on HD Graphics 4400 but don't work on Intel CPU i5.
I've expected that Nvidia algorithm works everywhere OR work only on Nvidia hardware so that difference in behavior between CPU and GPU on SAME machine looks strange for me.
Reduction algorithm and C# + OpenCL.NET unit test are in attachments. Unit test fails on Intel CPU with size = 4.
What differences in kernel execution exists between CPU and GPU? How can I fix the problem?
ps. I can attach a sample VS 2012 project if it needs.