Intel® SDK for OpenCL* Applications - Performance Debugging Intro

To the Intel® OpenCL SDK page

Table of Contents

1. Host-Side Timing
2. Wrapping the Right Set of Operations
3. Profiling Operations Using OpenCL Profiling Events
4. Comparing OpenCL Kernel Performance with Performance of Native Code
5. Getting Credible Performance Numbers
6. Using Tools

You can measure performance of applications in many ways; in particular for OpenCL™ kernels. For example there are host-side timing mechanisms like QueryPerformanceCounter or rdtsc. Still those “wall-clock” measurements might not provide any insights into the actual cost breakdown. We start this section with discussion of OpenCL™ profiling events. This article is equal to the chapter 4 of the Writing Optimal OpenCL™ Code with Intel® OpenCL SDK document. For code examples see the OpenCL Optimizations Tutorial sample from the SDK, located in \samples\SimpleOptimizations and the Tips and Tricks for Kernel Development page.

Host-Side Timing

Back to top

We will not discuss QueryPerformanceCounter API or other host-side timing mechanisms here. Refer to the OpenCL Optimizations Tutorial sample.
Below is trivial host-side timing routine around kernel call (error handling is omitted for simplicity):

float start = …;//getting the first time-stamp
	clEnqueueNDRangeKernel(g_cmd_queue, …);
	clFinish(g_cmd_queue);// to make sure the kernel completed
float end = …;//getting the last time-stamp
float time = (end-start);

A couple of things to pay attention to:

  • clEnqueueNDRangeKernel only puts a kernel to a queue and immediately returns.
  • Thus, to measure kernel execution time, you need to explicitly sync on kernel completion via call to clFinish or clWaitForEvents.

Wrapping the Right Set of Operations

Back to top

When using any host-side routine for evaluating the performance of your kernel, please ensure you wrapped the proper set of operations.

For example avoid including various printf calls, file i/o operations and other potentially costly and/or serializing routine.

Profiling Operations using OpenCL Profiling Events

Back to top

The next piece of code measures the kernel execution via profiling events. Again, error handling is omitted:

g_cmd_queue = clCreateCommandQueue(…CL_QUEUE_PROFILING_ENABLE, NULL);
clEnqueueNDRangeKernel(g_cmd_queue,…, &perf_event);
clWaitForEvents(1, &perf_event);
cl_ulong start = 0, end = 0;

clGetEventProfilingInfo(perf_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(perf_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);

//END-START gives you hints on kind of “pure HW execution time”
//the resolution of the events is 1e-09 sec
g_NDRangePureExecTimeMs = (cl_double)(end - start)*(cl_double)(1e-06); 

Important caveats:

  • The queue should be enabled for profiling (CL_QUEUE_PROFILING_ENABLE property) in creation time.
  • You need to explicitly sync via clWaitForEvents. The reason is that device time counters (for the command being profiled) are associated with the specified event.

This way you can profile operations on both Memory Objects and Kernels. Refer to section 5.12 of the OpenCL 1.1 standard for the detailed description of profiling events. Notice that host-side wall-clock time might return different results. For CPU the difference is typically negligible though.

Comparing OpenCL Kernel Performance with Performance of Native Code

Back to top

When comparing the OpenCL kernel performance with native code (e.g., C or Intel® SSE), make sure that you wrapped exactly the same set of operations. For example:

  • Don’t include program build time in the kernel execution time.
    • This build step can be amortized well via program pre-compilation (refer to clCreateProgramFromBinary).
  • Track data transfers costs separately.

Also ensure the working set is identical for native/OpenCL code. Similarly, for correct performance comparison, access patterns should be the same (e.g. rows vs. columns).

Finally make sure you’re demanding the same accuracy. For example rsqrt(x) built-in is inherently of the higher accuracy than __mm_rsqrt_ps SSE intrinsic. There are 2 options for more fair performance in this particular case:

  • Either equip __mm_rsqrt_ps in your native code with couple of additional Newton-Raphson iterations, to match the precision of OpenCL’s rsqrt.
  • Alternatively you can use native_rsqrt in your kernel, which would map exactly to rsqrtps instruction in the final assembly.

Similarly to rsqrt, there are relaxed versions for rcp, sqrt, etc, refer to the “Working with the -cl-fast-relaxed-math Flag” of the Intel® OpenCL SDK User's Guide for the full list.

Getting Credible Performance Numbers

Back to top

In the world of computing, performance conclusions are typically deduced from sufficiently large number of invocations of the same routine. Since the first iteration is almost exclusively slower than later iterations, minimum (or average, geomean, etc.) value for the execution time is usually used for final projections. A simple alternative to having a loop that calls your kernel a zillion times is having a single "warming" run, as explained in this section.

A "warming" run is especially helpful for small/lightweight kernels for which one-time overheads (like some "lazy" object creations, delayed initializations and other costs potentially incurred by the OpenCL run time) might really cost something. "Warming" run also brings data in the cache. Thus, for bandwidth-limited kernels operating on the data that doesn't fit last-level cache, the "warming" run is unlikely to help.

If your kernel is just a small number of instructions executed over a small data set, then even an infinitely precise measurement mechanism is very unlikely to yield a reliable result. This is due to the influence of the OS, cache, threading, etc. Consider having the kernel run for at least 20 milliseconds.

The bottom line is that you need to build your performance conclusions on reproducible data. If the "warming" run doesn't help and/or execution time still varies, you can try to run a large number of iterations and then average the results (for time values that range too much, geomean is preferable).

Remember that kernels that are too lightweight wouldn't give you reliable data, so making them artificially heavier could give you important insights into the hotspots. Examples are adding loop into the kernel, or replicating its heavy pieces.

Using Tools

Back to top

Once you get the stable/reproducible performance numbers, the next question would be about what to optimize first.

Unless you suspect some specific parts of the kernel (e.g. heavy math built-ins), we strongly recommend using Intel® VTune™ Amplifier XE to determine hot-spots as described in Working With the Intel® VTune™ Amplifier XE 2011 article.

Remember that tuning the kernel itself might in turn require tweaking the run-time parameters as well, e.g., increasing work-group size once your kernel gets faster (larger work-groups would help to amortize run-time overheads). That is why the best practice is letting the run-time decide on optimal local size as described above.

You can also check the overall CPU utilization and job distribution with Intel® GPA as detailed in Optimize OpenCL™ code with the Intel® Graphics Performance Analyzers.

Use the Offline Compiler to inspect resulting assembly as described in Inspect your code with the Intel® OpenCL SDK Offline Compiler. Check whether your kernel is vectorized as you expect it to be, especially if you're trying to compare to your hand-tuned Intel® SSE instructions.






OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

raja38's picture

This is very nice and intro article for the newcomers .

Mohamed Amine BERGACH's picture

Thank you for this article, it's very usefull.