For openCL kernel which runs on PC to acheive optimal performance on a quad core machine, it would be using a global_work_size of 4 with local_work_size of 1 and have a vectorized for-loop inside the kernel. Each of the four launched kernels would be processing one quorter of the array(s) within each thread.
I run same some comparisons for the c[i] = a[i] + b[i] kernel. Intel C++ compiler with threading still runs this code about 5x faster than current Open CL. Specifically, the kernel internal for-loops like:
__kernel void test(.... )
for (i = 0; i < Len; i++)
c[i] = a[i] + b[i];
are not vectorized, which is the main reason for this. If this for-loops are left out and global_work_size is increased, there is function call overhead for the kernel. (which basically defeats SSE) and the expected 5x slow down is demonstrated as expected.
Are there any plans for Intels OpenCL driver to vectorize kernel internal for-loops?
(which would bring the code speed on-par with C++).