Table of Contents
1. Host-Side Timing
2. Wrapping the Right Set of Operations
3. Profiling Operations Using OpenCL Profiling Events
4. Comparing OpenCL Kernel Performance with Performance of Native Code
5. Getting Credible Performance Numbers
6. Using Tools
You can measure performance of applications in many ways; in particular for OpenCL™ kernels. For example there are host-side timing mechanisms like QueryPerformanceCounter or rdtsc. Still those “wall-clock” measurements might not provide any insights into the actual cost breakdown. We start this section with discussion of OpenCL™ profiling events. This article is equal to the chapter 4 of the Writing Optimal OpenCL™ Code with Intel® OpenCL SDK document. For code examples see the OpenCL Optimizations Tutorial sample from the SDK, located in and the Tips and Tricks for Kernel Development page.
Host-Side Timing
We will not discuss QueryPerformanceCounter API or other host-side timing mechanisms here. Refer to the OpenCL Optimizations Tutorial sample.
Below is trivial host-side timing routine around kernel call (error handling is omitted for simplicity):
float start = …;//getting the first time-stamp clEnqueueNDRangeKernel(g_cmd_queue, …); clFinish(g_cmd_queue);// to make sure the kernel completed float end = …;//getting the last time-stamp float time = (end-start);
A couple of things to pay attention to:
clEnqueueNDRangeKernelonly puts a kernel to a queue and immediately returns.- Thus, to measure kernel execution time, you need to explicitly sync on kernel completion via call to
clFinishorclWaitForEvents.
Wrapping the Right Set of Operations
When using any host-side routine for evaluating the performance of your kernel, please ensure you wrapped the proper set of operations.
For example avoid including various printf calls, file i/o operations and other potentially costly and/or serializing routine.
Profiling Operations using OpenCL Profiling Events
The next piece of code measures the kernel execution via profiling events. Again, error handling is omitted:
g_cmd_queue = clCreateCommandQueue(…CL_QUEUE_PROFILING_ENABLE, NULL); clEnqueueNDRangeKernel(g_cmd_queue,…, &perf_event); clWaitForEvents(1, &perf_event); cl_ulong start = 0, end = 0; clGetEventProfilingInfo(perf_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(perf_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); //END-START gives you hints on kind of “pure HW execution time” //the resolution of the events is 1e-09 sec g_NDRangePureExecTimeMs = (cl_double)(end - start)*(cl_double)(1e-06);
Important caveats:
- The queue should be enabled for profiling (
CL_QUEUE_PROFILING_ENABLEproperty) in creation time. - You need to explicitly sync via
clWaitForEvents. The reason is that device time counters (for the command being profiled) are associated with the specified event.
This way you can profile operations on both Memory Objects and Kernels. Refer to section 5.12 of the OpenCL 1.1 standard for the detailed description of profiling events. Notice that host-side wall-clock time might return different results. For CPU the difference is typically negligible though.
Comparing OpenCL Kernel Performance with Performance of Native Code
When comparing the OpenCL kernel performance with native code (e.g., C or Intel® SSE), make sure that you wrapped exactly the same set of operations. For example:
- Don’t include program build time in the kernel execution time.
- This build step can be amortized well via program pre-compilation (refer to
clCreateProgramFromBinary).
- This build step can be amortized well via program pre-compilation (refer to
- Track data transfers costs separately.
- Also prefer data-mapping, section 4.1 of the Writing Optimal OpenCL™ Code with Intel® OpenCL SDK document; this is closer to the way data is passed in native code (i.e., by pointers).
Also ensure the working set is identical for native/OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (e.g. rows vs. columns).
Finally make sure you’re demanding the same accuracy. For example rsqrt(x) built-in is inherently of the higher accuracy than __mm_rsqrt_ps SSE intrinsic. There are 2 options for more fair performance in this particular case:
- Either equip
__mm_rsqrt_psin your native code with couple of additional Newton-Raphson iterations, to match the precision of OpenCL’srsqrt. - Alternatively you can use
native_rsqrtin your kernel, which would map exactly torsqrtpsinstruction in the final assembly.- Yet another way to enable similar optimization for the whole program is using
relaxed-mathcompilation flag; refer to the Writing Optimal OpenCL™ Code with Intel® OpenCL SDK document.
- Yet another way to enable similar optimization for the whole program is using
Similarly to rsqrt, there are relaxed versions for rcp, sqrt, etc, refer to the “Working with the -cl-fast-relaxed-math Flag” of the Intel® OpenCL SDK User's Guide for the full list.
Getting Credible Performance Numbers
In the world of computing, performance conclusions are typically deduced from sufficiently large number of invocations of the same routine. Since the first iteration is almost exclusively slower than later iterations, minimum (or average, geomean, etc.) value for the execution time is usually used for final projections. A simple alternative to having a loop that calls your kernel a zillion times is having a single "warming" run, as explained in this section.
A "warming" run is especially helpful for small/lightweight kernels for which one-time overheads (like some "lazy" object creations, delayed initializations and other costs potentially incurred by the OpenCL run time) might really cost something. "Warming" run also brings data in the cache. Thus, for bandwidth-limited kernels operating on the data that doesn't fit last-level cache, the "warming" run is unlikely to help.
If your kernel is just a small number of instructions executed over a small data set, then even an infinitely precise measurement mechanism is very unlikely to yield a reliable result. This is due to the influence of the OS, cache, threading, etc. Consider having the kernel run for at least 20 milliseconds.
The bottom line is that you need to build your performance conclusions on reproducible data. If the "warming" run doesn't help and/or execution time still varies, you can try to run a large number of iterations and then average the results (for time values that range too much, geomean is preferable).
Remember that kernels that are too lightweight wouldn't give you reliable data, so making them artificially heavier could give you important insights into the hotspots. Examples are adding loop into the kernel, or replicating its heavy pieces.
Using Tools
Once you get the stable/reproducible performance numbers, the next question would be about what to optimize first.
Unless you suspect some specific parts of the kernel (e.g. heavy math built-ins), we strongly recommend using Intel® VTune™ Amplifier XE to determine hot-spots as described in Working With the Intel® VTune™ Amplifier XE 2011 article.
Remember that tuning the kernel itself might in turn require tweaking the run-time parameters as well, e.g., increasing work-group size once your kernel gets faster (larger work-groups would help to amortize run-time overheads). That is why the best practice is letting the run-time decide on optimal local size as described above.
You can also check the overall CPU utilization and job distribution with Intel® GPA as detailed in Optimize OpenCL™ code with the Intel® Graphics Performance Analyzers.
Use the Offline Compiler to inspect resulting assembly as described in Inspect your code with the Intel® OpenCL SDK Offline Compiler. Check whether your kernel is vectorized as you expect it to be, especially if you're trying to compare to your hand-tuned Intel® SSE instructions.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Comments
This is very nice and intro article for the newcomers .
Thank you for this article, it's very usefull.