Please allow the kernels to be called without clEnqueueNDRange, but simply directly through the function pointer. This would allow the following features:
1.) Zero overhead for the thread start/stop because kernel runs on the current app thread. This makes it possible to accelerate "short" kernels which may need to run on one thread.
2.) Allows custom threading to be implemented by the caller
3.) Allows Intel to Inject latest high performance instructions in to any language via Open CL interface while keeping the dll style calling approach. People write performance sensitive code once in Open CL and for the next generation of CPU Intel simply releases a driver update.
4.) Makes the kernels debuggable with the full range of Intel debugging tools.
5.) Provides method to properly debug any Open CL code.