When comparing OpenCL™ kernel performance with native code, for example, C or Intel® Streaming SIMD Extensions (Intel® SSE) intrinsic, make sure that both versions are as similar as possible:
- Wrap exactly the same set of operations.
- Do not include program build time in the kernel execution time. You can amortize this step by program precompilation (refer to clCreateProgramFromBinary).
- Track data transfers costs separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a native code (by pointers). Refer to the “Mapping Memory Objects” section for more information.
- Ensure the working set is identical for native and OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (for example, rows compared to columns).
- Demand the same accuracy. For example, rsqrt(x) is inherently of higher accuracy than the __mm_rsqrt_ps SSE intrinsic. To use the same accuracy in native code and OpenCL code, do one of the following:
- Equip __mm_rsqrt_ps in your native code with a couple of additional Newton-Raphson iterations to match the precision of OpenCL rsqrt.
- Use native_rsqrt in your OpenCL kernel, which maps to the rsqrtps instruction in the final assembly code.
- Use the relaxed-math compilation flag to enable similar accuracy for the whole program. Similarly to rsqrt, there are relaxed versions for rcp, sqrt, etc. Refer to the User Manual - OpenCL™ Code Builder for the full list
For more complete information about compiler optimizations, see our Optimization Notice.