Comparing OpenCL™ Kernel Performance with Performance of Native Code

When comparing OpenCL™ kernel performance with native code, for example, C or Intel® Streaming SIMD Extensions (Intel® SSE) intrinsic, make sure that both versions are as similar as possible:

  • Wrap exactly the same set of operations.
  • Do not include program build time in the kernel execution time. You can amortize this step by program precompilation (refer to clCreateProgramFromBinary).
  • Track data transfers costs separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a native code (by pointers). Refer to the “Mapping Memory Objects” section for more information.
  • Ensure the working set is identical for native and OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (for example, rows compared to columns).
  • Demand the same accuracy. For example, rsqrt(x) is inherently of higher accuracy than the __mm_rsqrt_ps SSE intrinsic. To use the same accuracy in native code and OpenCL code, do one of the following:
    • Equip __mm_rsqrt_ps in your native code with a couple of additional Newton-Raphson iterations to match the precision of OpenCL rsqrt.
    • Use native_rsqrt in your OpenCL kernel, which maps to the rsqrtps instruction in the final assembly code.
    • Use the relaxed-math compilation flag to enable similar accuracy for the whole program. Similarly to rsqrt, there are relaxed versions for rcp, sqrt, etc. Refer to the User Manual - OpenCL™ Code Builder for the full list

See Also

Mapping Memory Objects
Considering native_ Versions of Math Built-Ins
User Manual - OpenCL™ Code Builder

For more complete information about compiler optimizations, see our Optimization Notice.