User Guide

Contents

Window: Summary - GPU Analysis

Use the
Summary
window as your starting point of the GPU Offload or GPU Compute/Media Hotspots performance analysis of the
Intel® VTune™
Profiler
. To access this window, click the
Summary
sub-tab in the result tab.
Use the Elapsed Time metric as your primary indicator and a baseline for comparison of results before and after optimization. Note that for multithreaded applications, the CPU Time is different from the Elapsed Time since the CPU Time is the sum of CPU time for all application threads.
Depending on the selected GPU analysis type, the following statistics is available in the
Summary
window:
You may click the
Copy to Clipboard
button to copy the content of the selected summary section to the clipboard.

GPU Utilization

If your system satisfies configuration requirements for GPU analysis (i915 ftrace event collection is supported), VTune Profiler displays detailed
GPU Utilization
analysis data across all engines that had at least one DMA packet executed. By default, the
VTune
Profiler
flags the GPU utilization less than 80% as a performance issue. In the example below, 85.9% of the application elapsed time was utilized by GPU engines.
GPU Utilization
Depending on the target platform used for GPU analysis, the
GPU Utilization
section in the Summary window shows the time (in seconds) used by GPU engines. Note that GPU engines may work in parallel and the total time taken by GPU engines does not necessarily equal the application Elapsed time.
You may correlate GPU Time data with the Elapsed Time metric. The GPU Time value shows a share of the Elapsed time used by a particular GPU engine. If the GPU Time takes a significant portion of the Elapsed Time, it clearly indicates that the application is GPU-bound.
If your system does not support i915 ftrace event collection, all the GPU Utilization statistics will be calculated based on the hardware events and attributed to the
Render and GPGPU
engine.
The
Summary
view provides the
Packet Queue Depth Histogram
that helps you estimate the GPU software queue depth per GPU engine during the target run:
Packer Queue Depth Histogram
Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for the zero queue depth.
For a high-level view of the DMA packet execution during the target run, review the
Packet Duration Histogram
:
Packet Duration Histogram
Select a required packet type from the drop-down menu and identify how effectively these packets were executed on the GPU. Having high Packet Count values for the minimal duration is optimal.
To get detailed information on the packet queues and execution, switch to the Platform tab and analyze the GPU software queue on the timeline.
For OpenCL™ applications, explore the
Hottest GPU Computing Tasks
section that helps you understand which OpenCL kernels had performance issues:
Hottest GPU Computing Tasks
Mouse over a flagged computing task for details on a performance issue. For example, for the
Intersect
computing task a significant portion of the GPU time was spent in stalls, which may result from frequent sampler or memory accesses. You may click a hot GPU computing task to open the Graphics window with this computing task pre-selected for your convenience.

EU Array Stalled/Idle

For the compute-bound workloads, explore the
EU Array Stalled/Idle
section that shows the most typical reasons why the execution units could be waiting. This section shows up for the analysis that collects Intel® HD Graphics and Intel® Iris® Graphics hardware events for the GPU Compute/Media Hotspots.
Depending on the event preset you used for the configuration, the
VTune
Profiler
analyzes metrics for stalled/idle executions units. The GPU Compute/Media Hotspots analysis by default collects the Overview preset including the metrics that track general GPU memory accesses, such as Sampler Busy and Sampler Is Bottleneck, and GPU L3 bandwidth. As a result, the
EU Array Stalled/Idle
section displays the Sampler Busy section with a list of GPU computing tasks with frequent access to the Sampler and hottest GPU computing tasks bound by GPU L3 bandwidth:
Sampler Busy
If you select the Compute Basic preset during the analysis configuration, the
VTune
Profiler
analyzes metrics that distinguish accessing different types of data on a GPU and displays the Occupancy section that helps you identify GPU tasks with low occupancy:
Low Occupancy
If the occupancy is flagged as a problem for your application, consider changing the size of computing tasks since too large or too small tasks typically make the EU array idle.
The
Compute Basic
preset also enables an analysis of the DRAM bandwidth usage. If the GPU workload is DRAM bandwidth-bound, the corresponding metric value is flagged. You can explore the table with GPU computing tasks heavily using the DRAM bandwidth during execution.
If you select the
Full Compute
preset and multiple run mode during the analysis configuration, the
VTune
Profiler
will use both
Overview
and
Compute Basic
event groups for data collection and provide all types of reasons for the EU array stalled/idle issues in the same view.
To analyze Intel® HD Graphics and Intel® Iris® Graphics hardware events, make sure to set up your system for GPU analysis

FPU Utilization

If your application execution takes more than 80% of collection time heavily utilizing both floating point units (FPUs), the
VTune
Profiler
highlights such a value as an issue and lists the kernels that overutilized the FPUs:
Click a flagged kernel to switch to the
Graphics
tab >
Timeline
pane, explore the distribution of the
GPU EU Instructions
metric that shows the FPU usage during the analysis run, and identify time ranges with the highest metric values. To address high FPU utilization issue for your code, consider reducing computations.

Bandwidth Utilization

For memory-bound applications, explore the
Bandwidth Utilization Histogram
section that includes statistics on the average system bandwidth and a Bandwidth Utilization histogram that shows how intensively your application was using each bandwidth domain:

Collection and Platform Info

Explore the platform information including GPU and CPU data. The last four GPU characteristics are specific to Intel® HD Graphics and Intel® Iris® Graphics.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804