GPU Compute/Media Hotspots Analysis

Run the GPU Compute/Media Hotspots analysis for applications that use a Graphics Processing Unit (GPU) for rendering, video processing, and computations with explicit support of Intel® Media SDK and OpenCL™ software technology.

GPU Hotspots

Use the GPU Compute/Media Hotspots analysis to:

  • Define GPU tasks with high GPU utilization, estimate the effectiveness of this utilization, and identify possible reasons for stalls or low occupancy

  • Identify how effectively your application uses OpenCL kernels (for Linux* and Window* targets only)

  • Analyze execution of Intel Media SDK tasks over time (for Linux targets only)

  • Explore GPU usage and analyze a software queue for GPU engines at each moment of time

  • Explore the performance of your application per selected GPU metrics over time

The tool infrastructure automatically aligns clocks across all cores in the entire system so that you can analyze some CPU-based workloads together with GPU-based workloads within a unified time domain.

Prerequisites

To analyze Intel HD and Intel Iris Graphics (further: Intel Graphics) hardware events on a GPU, your system needs to have Intel Metric Discovery API library installed. Depending on your target analysis system, do the following:

  • Windows* systems: Intel Metric Discovery API library is part of the official Intel Graphics driver package. You can install a driver for your system from https://downloadcenter.intel.com.

  • Linux* systems: installation of the Intel Metric Discovery API library depends on your system and kernel version.

    • For CentOS* 7, make sure to install a proper version of the Intel Media Server Studio (starting with version 2015 R5) and build the kernel driver as described in the Getting Started Guide.

    • For other Linux systems, check the kernel version:

      4.14 and higher

      To analyze GPU usage, configure your kernel as follows: CONFIG_EXPERT=y, CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y.

      If the VTune Amplifier cannot collect GPU hardware metrics and provides a corresponding error message, make sure you have installed Metrics Discovery API library from https://github.com/intel/metrics-discovery correctly. See the Troubleshooting topic for details.

      4.4 or 4.7

      Install Intel OpenCL SDK to patch and re-build the kernel as described in the Getting Started Guide.

      4.4.111

      4.4

      3.14.5

      Install Intel Media Server Studio version applicable to your kernel version:

      • Intel Media Server Studio 2018 for 4.4.111

      • Intel Media Server Studio 2017 for 4.4

      • Intel Media Server Studio 2016 for 3.14.5

      Build the kernel driver as described in the Generic section of the Getting Started Guide.

If you run GPU analysis via a Remote Desktop connection, make sure your software fits these requirements:

  • VTune Amplifier 2015 Update 2, or higher

  • Intel® Graphics driver version 15.36.14.64.4080, or higher

  • target analysis application runnable via RDC

Otherwise, run the VTune Amplifier from the target computer's console or access the computer via VNC.

How It Works: Intel Graphics Render Engine and Hardware Metrics

A GPU is a highly parallel machine where graphical or computational work is done by an array of small cores, or execution units (EUs). Each EU simultaneously runs several lightweight threads. When one of these threads is picked up for an execution, it can hide stalls in the other threads if the other threads are stalled waiting for data from memory or other units.

To use the full potential of the GPU, applications should enable the scheduling of as many threads as possible and minimize idle cycles. Minimizing stalls is also very important for graphics and general purpose computing GPU applications.

VTune Amplifier provides the Analyze Processor Graphics hardware events option to monitor Intel GPU hardware events and display metrics about integral GPU resource usage over a sampled period, for example, ratio of cycles when EUs were idle, stalled, or active as well as statistics on memory accesses and other functional units. If the VTune Amplifier traces GPU OpenCL kernels execution, it annotates each kernel with GPU metrics.

The scheme below displays metrics collected by the VTune Amplifier across different parts of the Intel Graphics:

GPU metrics help identify how efficiently GPU hardware resources are used and whether any performance improvements are possible. Many metrics are represented as a ratio of cycles when the GPU functional unit(s) is in a specific state over all the cycles available for a sampling period. For example, the VTune Amplifier collects data for the following basic GPU hardware metrics:

Metric

Formula

EU Array Active

EU Array Stalled

EU Array Idle

Configure and Run Analysis

By default, the GPU Compute/Media Hotspots analysis enables the Analyze GPU usage option to explore GPU busyness over time and understand whether your application is CPU or GPU bound. As a result of such an analysis configuration, if you explore the Timeline view in the Graphics window and see that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, you can conclude that your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.

Note

On Windows systems, to monitor general GPU usage over time, run the VTune Amplifier as an Administrator.

When the GPU is intensely busy over time, you may look deeper and understand whether it is used effectively and whether there is some room for improvement. Such an analysis is possible with additional configuration options provided with the GPU Compute/Media Hotspots.

To configure and run the GPU Compute/Media Hotspots analysis:

Prerequisites: Create a project and specify an analysis system and target.

  1. Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis toolbar button.

    The Configure Analysis window opens.

  2. From the HOW pane, click the Browse button and select Platform Analysis > GPU Compute/Media Hotspots.

    The GPU Compute/Media Hotspots analysis is pre-configured to collect GPU usage data, analyze GPU task scheduling and identify whether your application is CPU or GPU bound.

  3. Configure the following GPU analysis options:

    • Use the GPU sampling internal, ms field to specify an interval (in milliseconds) between GPU samples for GPU hardware metrics collection. By default, the VTune Amplifier uses 1ms interval for the hardware event-based sampling collection and 1000ms for the user-mode sampling and tracing collection.

    • Use the Analyze Processor Graphics hardware events option to monitor the Render and GPGPU engine usage (Intel Graphics only), identify which parts of the engine are loaded, and correlate GPU and CPU data. This option requires root/administrative privileges and is available for Windows*, Linux* (see the prerequisites above) and Android* targets.

      VTune Amplifier provides platform-specific presets of the hardware metrics. All presets collect data about execution units (EUs) activity: EU Array Active, EU Array Stalled, EU Array Idle, Computing Threads Started, and Core Frequency.

      • Overview event set also includes metrics that track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications.

      • Compute Basic (with global/local memory accesses) event group also includes metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.

      • Compute Extended event group includes metrics targeted only for GPU analysis on the Intel processor code name Broadwell and higher. For other systems, this preset is not available.

      • Full Compute event group is a combination of the Overview and Compute Basic event sets.

    • Use the Trace OpenCL and Intel Media SDK programs option to explore execution time for runtimes, monitor performance of each program per GPU metrics and identify hotspots.

      For OpenCL applications, you may identify the hottest OpenCL kernels, explore the computing queue in the Graphics window that presents kernels submission to execution on the GPU, correlate this data with the hardware metrics, and identify the GPU architecture block where a performance issue for a particular kernel was detected.

      For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and correlate this data with the GPU usage at each moment of time.

      Support limitations:

      • OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics

      • Intel Media SDK program analysis is possible for Linux targets running on Intel Graphics

      • Only Launch Application or Attach to Process target types are supported

      Note

      In the Attach to Process mode if you attached to a process when the computing queue is already created, VTune Amplifier will not display data for the OpenCL kernels in this queue.

  4. Optionally, select the Collect stacks option to analyze performance and parallelism per execution path.

  5. Click Start to run the analysis.

To run the GPU Compute/Media Hotspots analysis from the command line, enter:

$ amplxe-cl -collect gpu-hotspots [-knob <knob_name=knob_option>] -- <target> [target_options]

Note

  • Windows only: If the VTune Amplifier cannot find the required version of the GPU driver, set the PATH variable to igdmd32.dll or igdmd64.dll that is typically located in the %SystemRoot%, for example: in C:\Windows\System32\DriverStore\FileRepository\igdlh64.inf_amd64_5b7983371eebfcc4.

  • You may generate the command line for this configuration using the Command Line... button at the bottom.

View Data

VTune Amplifier runs the analysis and opens the data in the GPU Compute/Media Hotspots viewpoint providing various platform data in the following windows:

  • Summary window displays overall and per-engine GPU usage, percentage of time the EUs were stalled or idle with potential reasons for this, and the hottest GPU computing tasks.

  • Graphics window displays CPU and GPU usage data per thread and provides an extended list of GPU hardware metrics that help analyze accesses to different types of GPU memory. For GPU metrics description, hover over the column name in the grid or right-click and select the What's This Column? context menu option.

  • Platform window displays over-time data as GPU usage on a software queue, CPU time usage, OpenCL kernels data, and GPU performance per the selected group of GPU hardware metrics, DRAM Bandwidth, and Core Frequency.

  • Bottom-up window displays hotspot GPU computing tasks in the bottom-up tree, GPU metrics, and, if collected, call stacks.

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)