• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Comparing OpenCL™ Kernel Performance with Performance of Native Code

When comparing OpenCL™ kernel performance with native code, for example, C or Intel® Streaming SIMD Extensions (Intel® SSE) intrinsic, make sure that both versions are as similar as possible:
  • Wrap exactly the same set of operations.
  • Do not include program build time in the kernel execution time. You can amortize this step by program precompilation (refer to
    clCreateProgramFromBinary
    ).
  • Track data transfers costs separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a native code (by pointers). Refer to the “Mapping Memory Objects” section for more information.
  • Ensure the working set is identical for native and OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (for example, rows compared to columns).
  • Demand the same accuracy. For example,
    rsqrt(x)
    is inherently of higher accuracy than the
    __mm_rsqrt_ps
    SSE intrinsic. To use the same accuracy in native code and OpenCL code, do one of the following:
    • Equip
      __mm_rsqrt_ps
      in your native code with a couple of additional Newton-Raphson iterations to match the precision of OpenCL
      rsqrt
      .
    • Use
      native_rsqrt
      in your OpenCL kernel, which maps to the
      rsqrtps
      instruction in the final assembly code.
    • Use the relaxed-math compilation flag to enable similar accuracy for the whole program. Similarly to
      rsqrt
      , there are relaxed versions for
      rcp
      ,
      sqrt
      , etc. Refer to the
      User Manual - OpenCL™ Code Builder
      for the full list
See Also

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.