I mesured the minimum time to run the kernel (which defines the meaningfull size of the smallest job) for AMD and Intel Open CL drivers. I get about 55us for AMD drivers (CPU) and 155us for Intel (CPU) drivers. I am sort of a stummped with these delays as the GPU has only a 15us overhead and there is PCI bus between.
I have also tested that I can get a 2us thread start/stop times on the CPU using C++. Two microseconds is about equal to the thread slice time. It would have been reasonable for openCL kernel launch to take 4-8us on CPU for example, but 155us is a lot. (Times were measured by timing average execution time of 3000 kernels without copying buffers, but including 8 calls to clSetKernelArg).
Is there some way to improve on this measurement?