Download Collecting OpenCL*-related Metrics with Intel® Graphics Performance Analyzers [PDF 440KB]
Intel® SDK for OpenCL* Applications 2012 is a comprehensive software development environment for OpenCL visual computing applications on 3rd generation Intel® Core™ processor family-based platforms.
Starting with the 2012 release, Intel SDK for OpenCL Applications supports the OpenCL 1.1 full profile on 3rd gen Intel Core processors with Intel® HD Graphics 4000/2500. The SDK is available at www.intel.com/software/opencl.
Optimizing performance for Intel HD Graphics devices is discussed in the Intel® SDK for OpenCL* Applications 2012 OpenCL* Optimization Guide. The SDK also provides integration with the Intel® Graphics Performance Analyzers (Intel® GPA), which enables you to optimize and analyze your OpenCL code in visual computing applications.
Intel GPA support various metrics for Intel CPU and HD Graphics devices. Some metrics are specific to the rendering (Microsoft DirectX* API) pipeline only, while some are more general and can be associated with OpenCL execution.
With Intel GPA you can also inspect various important hardware counters for Intel CPU and HD Graphics devices in real time, for example:
- Utilization of CPU cores and the execution units in Intel HD Graphics devices
- Memory traffic for Intel HD Graphics devices
- Power consumption, and so on
This article provides an overview of the OpenCL-related metrics for Intel HD Graphics devices. For CPU metrics, more fine-grain methods exist, for example, using Intel® VTune™ Amplifier XE software. For details on profiling for OpenCL on CPU see the Tuning with the Intel® VTune™ Amplifier XE chapter of the Intel® SDK for OpenCL* Applications User’s Guide. You can also check the overall CPU utilization and job distribution with the Intel GPA, as detailed in section Using the Intel® Graphics Performance Analyzers (Intel® GPA) Platform Analyzer also in the User’s Guide.
Intel® GPA System Analyzer
To get access to metrics you should run your application with Intel® GPA Monitor first. Then you should use the Intel® GPA System Analyzer Heads-up Display (HUD), which displays metrics in panels on top of your application (Figure 1, left). Alternatively, you open the standalone version of the System Analyzer in a separate window (Figure 2, right). See the Intel® Graphics Performance Analyzers 2012 R3 Online help for details.
When you run the HUD version of the Intel GPA System Analyzer (by pressing Ctrl-F1), you can only display four metrics simulateneously. You can select the metrics to display in the Profile settings of the Intel GPA Monitor, Figure 1 right.
When you run the standalone version of the Intel GPA System Analyzer (Figure 2), you can display many metrics at the same time. Running the standalone version of the System Analyzer does not load the GPU with its own GUI rendering, making the standalone version preferable when potential overheads, associated with the System Analyzer itself, are of concern. You can use it over a network connection, although local profiling (with dual computer monitors or non-full screen applications) is possible. Be aware that the overhead of collecting the metrics may impact your measurements, so look at any data you collect carefully to make sure it makes sense. See Running the Intel® GPA System Analyzer in Local and Network Modes section of the Intel® Graphics Performance Analyzers 2012 R3 Online help for details.
Figure 1. Left: Intel® GPA System Analyzer HUD (press Ctrl-F1 to switch the modes). Right: HUD metrics for a selected Intel® GPA Monitor profile.
Figure 2. Standalone version of the Intel® GPA System Analyzer. Left: connection to the machine (use “localhost” to connect to the local machine). Right: List of applications available for profiling.
After you connect to the machine and select the application to profile, the main window of the analyzer appears (Figure 3).
Figure 3. Intel® GPA System Analyzer in action
When profiling non-DirectX applications, only CPU metrics are available. The reason is that even general GPU metrics are queried as deltas within each Direct3D* frame. Thus, you cannot view any GPU–related metrics for console applications.
The following types of OpenCL-related metrics are supported:
- CPU-specific metrics, such as core utilization
- HD Graphics execution units (EUs) metrics, of which GPU EUs active/idle/stalled are the most important
- Memory metrics, such as GPU memory reads/writes
- Power metrics for CPUs, Intel HD Graphics devices, and the whole package
You can locate them in the metrics tree at the left side of the Intel GPA System Analyzer (Figure 4). The “GPU busy” metric does not include general computations, like OpenCL, so it actually means “GPU Busy within 3D rendering contexts.” See the Intel® Graphics Performance Analyzers 2012 R3 Online help for details.
Figure 4. List of metrics available in Intel® GPA System Analyzer
Commonly Used Metrics
The screenshot of the System Analyzer tool in Figure 3 shows the available metrics on the left. The active (drag-n-dropped from the list) metrics are displayed in charts. In this section we describe some of the metrics you will commonly use.
Isolating OpenCL* Tasks
Generally, metrics are collected on the entire system. Hence, tasks like rendering, desktop/Windows* updates, and general computations with OpenCL are collected together. Try to minimize the influence of rendering when debugging OpenCL performance for Intel HD Graphics devices. Various video activities (for example, Intel® Quick Synchronization Video or hardware-assisted transcoding with Intel® Media SDK) also influence the metrics, see the Using Intel® Graphics Performance Analyzers (GPA) to analyze Intel® Media Software Development Kit-enabled applications for an example analysis.
You can isolate the OpenCL tasks in several ways:
- Temporarily remove/comment expensive Microsoft DirectX calls.
- If possible, skip intensive rendering with a simple change of camera direction or position, for example to make the most of scene out-of-view frustum.
- Zoom out so that just a few pixels stay processed with expensive pixel shaders.
- Estimate or minimize the number of pixels with “Pixels Rendered” metric.
Similarly, you can minimize the influence of video transcoding by disabling the decoding (and working with a single frame), etc.
See the OpenCL* and Intel® Media SDK Interoperability sample, which enables you to pause/resume decoding and processing with OpenCL with simple GUI. You can experiment with it using the System Analyzer to understand the contribution to the metrics of the different samples’ pipeline stages.
The frame time metric represents the instantaneous frame time in microseconds. When you choose this metric, the information on frames duration displays (see the screenshot in Figure 5).
Figure 5. Frame time metric
This Frame Time metric is for the DirectX frame (for example, time between Present calls). Refer to the section above (“Isolating OpenCL* tasks”) on general recommendations for a careful breakdown of frame costs.
General Execution Units Metrics
Unless your algorithm is memory-bound, the execution units (EUs) are likely the performance bottleneck for your application and the EU metrics can provide information on these bottlenecks. The goal is to maximize the utilization of EUs with useful computations. Refer to the OpenCL* Optimization Guide for tips and tricks.
The following information briefly describes the EU-related metrics:
- GPU EUs Active represents the percentage of time when the GPU execution units (EUs) were actively executing. Idle is the percentage of time when the GPU execution units (EUs) were idle. An EU is idle when it is neither actively executing instructions nor stalled trying to execute shader instructions (below).
- GPU EUs Stalled metric represents the percentage of time when the GPU execution units (EUs) were stalled. An EU becomes stalled when all of its threads are waiting for results from fixed function units, for example, requesting data from the texture sampler, which can be detailed with the dedicated “EUs Stalled on Samp.” metric.
In the screenshot in Figure 6, the percentage of GPU EUs Stalled is quite high. This might indicate inefficient memory bandwidth usage (for example, suboptimal data access granularity or cache trashing, so that the GPU waits for data to arrive). If you suspect the data stalls, you could change the kernel to increase the operations/byte ratio.
Figure 6. General Execution Units Metrics
Stalling might also be a result of contention for fixed function units, for example, mathbox, used for things like transcendental math. This type of stall can be detailed with EUs Stalled on Math metric. Try relaxed/native precision in this case. See the Intel® Graphics Performance Analyzers 2012 R3 Online help for details.
Finally, if the number of workgroups in the flight is insufficient, the EUs utilization might be really low (GPU EUs Idle will be high). Too low a value for the local size provided to the clEnqueueNDRange call can also result in units being idle. See the “Work-Group Size Recommendations Summary” section in the OpenCL* Optimization Guide for details.
Among all of the memory–related metrics, the GPU Memory Reads/Writes are attributed to an OpenCL kernel execution. When you choose this metric, the memory traffic of your application displays, similar to the screenshot in Figure 7.
Figure 7. Memory metrics
For this specific example, the total (reads+writes) traffic is rather intensive, totaling ~13 GBytes/sec, which is pretty sane bandwidth saturation. See the OpenCL* Optimization Guide for theoretical memory performance and hints on saturating the bandwidth.
Try to access data in large (128-bit) chunks. See the OpenCL* Optimization Guide for details. Another good optimization to relax the bandwidth pressure is using local memory.
The key points covered in this paper are:
- Intel GPA System Analyzer provides a convenient way to track the selected set of metrics (counters), both HUD or the standalone version.
- System Analyzer provides a set of OpenCL-related metrics that cover Intel CPU and HD Graphics devices and socket (system) utilization.
- For Intel HD Graphics devices, metrics for execution units and memory utilization are the most important.
- Refer to the OpenCL* Optimization Guide for more tips on metric values.