I'm wondering how threads are dispatched over SIMD units of the intel Ivy Bridge HD 4000 GPU, I tested many configurations and I'm blocked by some strange behaviours:
I use a simple kernel that compute N times the same "MAD" operation, I launch this kernel with global_size=local_size=1 , for the best to my knowledge I assume that the GPU will launch one thread on one EU ? is it correct ? the strange behaviours that I'm encountering : when I use the computation in my kernel as a scalar (float) I have about 2GFlops of performance, But when I try to use "MAD" as a vector (float2,float4, float8 or foat16) the performance falls dramatically to 0.1 Gflops , am i missing something ? can any one help me to understand ?