Intel® VTune™ Amplifier XE: Getting started with OpenCL* performance analysis on Intel® HD Graphics

Download PDF [1MB]


If you are looking to profile applications that use Intel® HD Graphics or Intel® Iris™ Graphics (referred to as GPU in this article) for rendering, video processing, and computations, the Intel® VTune™ Amplifier XE is the performance profiler for you. The Intel VTune Amplifier XE can monitor, analyze, and correlate activity on both the CPU and GPU.

The VTune Amplifier tracks the overall GPU activity (graphics, media, and compute), collects the Intel® Integrated Graphics hardware metrics, details OpenCL™ activity on the GPU, and then presents them correlated with the CPU processes and threads. Figure 1 shows CPU and GPU activities presented by the VTune Amplifier.

VTune Amplifier Activity
Figure 1.The VTune™ Amplifier view representing CPU/GPU activity correlation and GPU OpenCL™ kernels of the Media-OpenCL interoperability sample application executing on the 4th generation Intel® Core™ processor.

Tracking GPU activity helps to identify:

  • whether the workload is CPU or GPU bound
  • what are the most time-consuming OpenCL kernels running on the GPU
  • if the GPU is being used at its full potential and what is the headroom for improvement, if any

Tuning an application is a combination of optimizing the CPU and GPU and the interactions between them. So, the application optimization typically starts with an API-level analysis to sanitize the general application flow and improve overall CPU and GPU utilization, followed by GPU-side analysis of the OpenCL kernels execution. This paper just briefly touches on CPU / GPU interactions, but it focuses on the analysis and optimization of GPU OpenCL kernels.

To get the most out of this discussion, you should be familiar with VTune Amplifier XE, its workflow, and main concepts. You can refer to the VTune Amplifier help for further information.

Currently this functionality is supported only on Windows* OSs. General GPU busyness data is available on any Intel® processor running Windows Vista, Windows 7, or Windows 8, but the Intel Integrated Graphics hardware metrics and GPU OpenCL data are only available on the 3rd and 4th generation Intel® Core™ processors and the Intel® Core™ M processor.

Configuring Collection

Important: Similar to running graphics applications with hardware acceleration, you cannot run GPU data collection via a Remote Desktop connection. To run the GPU data collection, you must be running the VTune Amplifier from the target computer’s console or accessing the computer through VNC.

To monitor general GPU busyness over time, run the VTune Amplifier as Administrator.

From the New Analysis dialog use the Advanced Hotpots (BasicHotspots, Concurrency, or Locks and Waits) analysis to monitor GPU activities. GPU collection is controlled by two settings:Analyze GPU Usage and Analyze Processor Graphics events as shown in Figure 2.

We suggest you configure collection with Analyze GPU Usage enabled and choose Overview at Analyze Processor Graphics events.

Collection Configuration
Figure 2. Collection configuration dialog with widgets enabling GPU collection

To enable OpenCL profiling on the GPU, select the option Trace OpenCL kernels on Processor Graphics.

When collection and post-processing have completed, switch to the Graphics tab to see details of GPU activity, also correlated with CPU processes and threads as shown in Figure 1.


This paper focuses on GPU OpenCL application analysis. But all GPU OpenCL applications do a lot of work on the CPU so the analysis must include both the CPU and GPU.

The GPU OpenCL command-queue is created through the dedicated OpenCL API call on the CPU. GPU programs are built on the CPU before the resulting kernels are submitted to the GPU for execution. OpenCL calls can take a long time to execute and thus influence the application’s progress. Refer to the next sections explaining how the GPU OpenCL command-queue and OpenCL API calls are presented by the VTune Amplifier.

GPU performance analysis and possible optimization require an understanding of OpenCL concepts and how they map to the underlying GPU architecture. Refer to the OpenCL* optimization guide, which provides the key details. The next two sections give the basics of Intel Integrated Graphics architecture, some mapping between GPU metrics and OpenCL primitives and how this information is presented by the VTune Amplifier.

Hot GPU OpenCL* kernels

As shown in Figure 1, switching Grouping to Computing Tasks Purpose / Computing Task (GPU) / Instance provides information about all OpenCL kernels executed on the GPU as well as OpenCL memory commands.

The corresponding columns show the overall execution time for a particular kernel on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel), work-group sizes, as well as averaged GPU hardware metrics collected for a kernel.

Every clCreateKernelresults in one entry (a line) in the table under the Compute node. If two different kernels with the same name (even from the same source file) were created through two clCreateKernel calls (and then invoked through two or more clEnqueueNDRangeKernel calls), two lines with the same kernel name will appear in the table. Computing Task Purpose / Source Compute Task (GPU) grouping is useful to aggregate data per the same kernel source.

You should analyze and optimize hot kernels with the longest overall times first. These include kernels characterized by long average times and kernels whose average times are not long, but they are invoked more frequently than the others. Both groups deserve attention.

Memory commands that transfer data from, to, or between memory objects, or map and unmap memory objects also contribute to the overall time on CPU and GPU. They are shown in the table under the Transfer node.

OpenCL™ API calls and GPU Command-queue

Host OpenCL API calls are shown on the corresponding CPU threads on the Graphics tabTimeline.

VTune Amplifier API Calls
Figure 3.The VTune™ Amplifier view representing GPU OpenCL™ command-queue and OpenCL API calls. Notice ProcessUV kernel selected in the GPU OpenCL command-queue lane is also automatically highlighted in the Grid. GPU HW metrics lanes, except GPU Execution Units metrics, are hidden.

A GPU OpenCL command-queue is shown as one of the Timeline lanes. It holds a kernel or a memory command from the moment it is queued to the command-queue until it executed to the completion on the GPU. The GPU OpenCL command-queue lane enables inspecting the state of the queue at every moment as well as command durations on the GPU.

Different OpenCL kernels and OpenCL API calls are shown in different colors. Memory commands are marked with a hatching pattern. Figure 3 presents a fragment of the application with OpenCL API calls and GPU OpenCL command-queue on the Timeline.

A correlation of OpenCL API calls on the CPU and GPU commands in the OpenCL command-queue helps detect the situation when the GPU is idle because it is waiting for some long calls executing on the CPU. It is also important to be aware of memory commands and the time they take. To avoid copying data, align buffers when allocating them on the CPU to be shared with OpenCL context via CL_MEM_USE_HOST_PTR or let OpenCL runtime efficiently allocate host shareable memory objects with CL_MEM_ALLOC_HOST_PTR.

A few words about GPU architecture

A GPU is a highly parallel machine where computational work is done by an array of small cores, a.k.a. execution units (EUs). Each EU simultaneously runs several light-weight threads. When one of those threads is picked up for execution, it can hide stalls in the other threads if the other threads are stalled waiting for data from memory or something else.

To use the full potential of the GPU, parallel applications should enable the scheduling of as many threads as possible and minimize idle cycles. Minimizing stall cycles is also very important for GPU applications.

Figure 4.Schematic view of Intel® Integrated Graphics hardware metrics with respect to GPU functional units

Intel GPU hardware metrics provide information about integral GPU resource usage over a sampled period, for example, the ratio of cycles when EUs were idle, stalled, or active as well as statistics on memory accesses and other functional units. Figure 4 gives a schematic view of the metrics VTune can display collected across different parts of the Intel GPU.

If you want to better understand the details of Intel® processor graphics architecture, please read the Graphics Architecture Guide

What do the metrics tell?

GPU hardware metrics show if the GPU hardware resources are being used efficiently and if any performance improvements are possible.

Many metrics are represented as a ratio of cycles when the GPU functional unit(s) is in a specific state over all the cycles available for a sampling period. The three major examples of this are given below.

The metric describing EU activity, EU Array Active, is

The metric describing EU stalls (EU Array Stalled) is

The metric describing EU idle (EU Array Idle) is

Possible values for these three metrics are between 0 and 1.

Using metrics for optimization

The first metric to look at is EU Array Idle. Idle cycles are wasted cycles. No threads are scheduled and the EUs’ precious computational resources are wasted. If EU Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them. If it is non-zero, it usually means one of two things: either there are problems in thread scheduling (e.g., an insufficient number of work-groups to run in parallel; thread dispatch serialization is preventing threads from being scheduled on some EUs for many cycles) or there is an imbalance between threads within a work-group. Thread dispatch serialization becomes a gating factor when a kernel has insufficient work per a work-item. An imbalance occurs when some threads in a work-group have already finished their work while others are still running, so another work-group cannot start until all threads of the previous one are completed.

When some of the kernel GPU hardware metrics signal performance issue(s), the VTune Amplifier highlights the corresponding metric value(s) in pink in the table. Hover the mouse on that cell and the performance analyzer will show a hint explaining the problem and possible fix.

In most cases the optimization strategy is to minimize the EU Array Stalled metric and maximize the EU Array Active. The exception is memory bandwidth-bound algorithms where optimization should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than maximize EU Array Active).

Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully designed memory access cannot be overestimated. If the EU Array Stalled is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.

One of the optimization techniques is to use shared local memory (SLM). To do this, copy the contents of global buffers to SLM so the contents can be collectively reused by work-items within a work-group. You should use SLM with an understanding that it decreases the effective L3 space if a significant amount of SLM space is requested, and the number of work-groups running in parallel may be also limited.

A good experimentation technique is to reserve a portion of SLM in a kernel source simply as
local float temp[BUFF_SIZE];
where 4*BUFF_SIZE is the SLM size you are going to use. Then check if the kernel average time and metrics significantly change as a result. If they do not change, this means that the SLM size you need does not significantly affect the kernel L3 footprint and the work-groups running in parallel, so SLM-based optimization might be successful.

Kernel SLM accesses are measured by GPU Shared Local Memory Read/Write metrics.

Sampler accesses are expensive and can easily cause stalls. They are induced by read_image<> calls within a kernel. They result in Sampler Is Bottleneck and Sampler Busy metrics. Whenever possible, avoid sampler accesses and use simple memory buffers instead of read_image<> calls. However, this is not always possible. In cases when Sampler Is Bottleneck is non-zero, the best way to decrease stalls may be by having adjacent work-items request adjacent pixels in your read_image<> calls, if this wasn’t already done. Adjacent pixels ensure a good locality and, as a result, the Sampler returns data faster.

Please refer to the Appendix for the detailed list of all Intel® VTune™ Amplifier XE metrics.


This paper presents an overview of the VTune Amplifier capabilities to analyze and optimize GPU OpenCL applications.

About the Intel® SDK for OpenCL Applications

TheIntel® SDK for OpenCL™ Applications 2014 is the latest version of the Intel SDK for OpenCL applications products. The SDK is a comprehensive software development environment for OpenCL applications on 3rd and 4th generation Intel Core processors supporting OpenCL 1.2, and on the latest Intel® Core™ M processor supporting OpenCL 2.0 for Windows 7 and Windows 8 operating systems. This SDK provides developers with tools for the building, debugging, and tuning stages of OpenCL application development. For the best development experience, we recommend you use the Intel SDK for OpenCL in conjunction with the Intel VTune Amplifier XE.

The SDK is available free of charge at


GPU – graphics processing unit

EU – execution unit, a single core in the array that does all computational work on the GPU

SLM – shared local memory, software controlled memory shared by working items of one OpenCL working group

LLC – last level cache, a cache located in the CPU uncore that is shared by CPU cores and integrated GPU

VTune analyzer metrics for Intel Integrated Graphics hardware

Metric. Preset 1,2Metric descriptionHow to use

EU Array Active

The normalized sum of all cycles on all cores spent actively executing instructions.

For a compute bound code, it should be as close to 1 as possible.

EU Array Stalled

The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is stalled for some reason.

If non zero, look for possible causes of the stall (e.g., memory or sampler accesses).

EU Array Idle

The normalized sum of all cycles on all cores when no threads were scheduled on a core. By definition, it equals (1 – (EU Active + EU Stalled)).

Identify imbalance or thread scheduling problems. Ideally, it should be 0.

Compute Shader Threads

Number of threads started across all EUs for compute work.

Helps distinguish compute work from graphics work. For pure graphic activity (without compute shaders), this metric will be zero.

Metric. Preset 1Metric descriptionHow to use

GPU Memory Reads/Writes

Reads/Writes from GPU from/to chip uncore (LLC) and memory. Those are all memory accesses that miss in internal GPU L3 cache and are serviced either from uncore or main memory.

Accesses serviced from uncore (LLC) or main memory often cause stalls (observed as EU Stalled != 0) as they have longer latencies.
But these latencies can be hidden if there is enough computational work in a kernel, which results in no stalls (EU Stalled ~ 0).
The optimization of memory accesses should be considered only when EU Stalled !=0 together with the presence of a significant amount of L3 misses. For compute and memory latency bound code, try to improve memory layout and the access pattern (e.g., use SLM, re-use data in L3). For memory bandwidth bound code, try to achieve the maximal bandwidth regardless of the presence of EU stalls (e.g., put more work into a single work-item to decrease the number of the work-groups)

L3 Cache Misses

All read and write misses in GPU L3 cache.

Sampler Busy

The normalized sum of all cycles on all cores when the Sampler was busy (e.g., due to read_image).

If ~ 1, this might be a cause of the stalls. Decrease the use of Sampler (e.g., use simple buffers created with clCreateBuffer).


Sampler stalls EUs due to the full input fifo queue, and starves the output fifo, so EUs need to wait to submit requests to Sampler.

If >~ 0.01, might be a cause of the stalls. Decrease use of Sampler or access it with a better locality.

Texture Read

Sampler unit misses in sampler cache.

Might be a cause of the stalls. Decrease use of Sampler or access it with a better locality.

Metric. Preset 2Metric descriptionHow to use

Untyped Memory Reads/Writes

Memory accesses to buffer created with clCreateBuffer

Counts all accesses

Typed Memory Reads/Writes

Memory accesses to typed buffers, e.g., writes to buffers created with clCreateImage. However, reads from images are counted by Sampler accesses and Texture Read.

Counts all accesses

SLM Reads/Writes

Memory accesses to Shared Local Memory.

Counts all accesses to SLM and reports SLM bandwidth. Useful to see how close SLM has achieved its theoretical bandwidth peak.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel logo, Core, and VTune are trademarks of Intel Corporation in the US and/or other countries.

Copyright © 2014 Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others.

OpenCL and the OpenCL logo are trademarks of Apple Inc and are used by permission by Khronos.

PDF icon VTune-GPU-Getting-Started.pdf3.44 MB
For more complete information about compiler optimizations, see our Optimization Notice.