Intel® VTune™ Amplifier GPU In-kernel Profiling feature with OpenCL™ sample code

By Jong Il Park, Published: 09/14/2017, Last Updated: 09/14/2017

This article will introduce the GPU In-kernel Profiling feature in Intel® VTune™ Amplifier, using one of the OpenCL™ samples from the Intel® SDK for OpenCL™ Applications. We will cover a brief introduction to OpenCL and the Intel SDK for OpenCL Applications, as well as walking through the process of profiling an OpenCL application using VTune Amplifier's GPU profiling features.

What is OpenCL?

OpenCL is the open standard for parallel programming of heterogeneous systems: 

OpenCL™ (Open Computing Language) is the open, royalty-free standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms. OpenCL greatly improves the speed and responsiveness of a wide spectrum of applications in numerous market categories including gaming and entertainment titles, scientific and medical software, professional creative tools, vision processing, and neural network training and inferencing.

It is the framework for cross-platform parallel programming such as CPU, GPU, and FPGA. When you are developing software for heterogeneous systems and have a workload best suited to something other than a CPU, like GPGPU, OpenCL is the right solution. 

  • A framework for developing data-parallel programs for multiple kinds of devices
  • C-like language for heterogeneous systems 
  • OpenCL kernel is a basic function to execute an array of work items.
  • One set of specific work items can be an work group which is executed in one compute unit.
  • OpenCL source codes (in devices) are compiled at run-time and the host code (CPU) triggers it.
  • Using Clang / LLVM and generating SPIR
  • The goal is to make the best use of all the available resources (CPU, GPU, FPGA, etc) in a single program.

Intel® SDK for OpenCL™ Applications 

Intel is a strong supporter of OpenCL™ technology. The Intel® SDK for OpenCL™ Applications is a comprehensive development environment for developing and optimizing OpenCL™ applications on Intel® platforms, and part of an increasingly rich portfolio of Intel tools for heterogeneous programming. The SDK supports offloading compute-intensive parallel workloads to Intel® Graphics Technology using an advanced OpenCL™ kernel compiler, runtime debugger and code performance analyzer. The SDK and driver/runtime packages are installed separately.

Block Diagram:

Getting Started

  1. Get the free OpenCL download.
  2.  Install OpenCL at C:\Intel\OpenCL and use its Microsoft Visual Studio* plug-in: Intel® Code Builder for OpenCL™ API.
  3.  You may wish to read this reference.
  4.  Download the General Matrix Multiply sample from the Code Sample repository.
  5.  Unpack the sample file downloaded and open the solution file in Microsoft Visual Studio. 

Profiling with VTune™ Amplifier

The first step of OpenCL profiling is a VTune™ Amplifier GPU Hotspots analysis. Once you identify the hottest OpenCL kernel, you need to investigate this kernel with GPU in-kernel profiling. GPU In-kernel Profiling analyzes GPU kernel execution per code line to find any performance issues which may be caused by inefficient kernel code algorithms or incorrect work item configuration.

The key metrics of GPU In-kernel Profiling are:

  • Estimated GPU Cycles: The average number of GPU cycles per one kernel instance .
  • GPU Instructions Executed per Instance: The average number of GPU instructions executed per one kernel instance.
  • GPU Instructions Executed per Thread: The average number of GPU instructions executed by one thread per one kernel instance.

VTune™ Amplifier GPU In-kernel Profiling with GEMM OpenCL Sample

The General Matrix Multiply (GEMM) sample demonstrates how to efficiently utilize an OpenCL* device to perform a general matrix multiply operation on two dense square matrices. General Matrix Multiply is a subroutine that performs matrix multiplication:

C := alpha*A*B + beta*C

where A, B and C are dense matrices and alpha and beta are floating point scalar coefficients. The sample supports single-precision and double-precision data types for matrix elements (as well as alpha and beta constants).

Using VTune™ Amplifier GPU In-kernel Profiling to Compare Results of Different Work Group Sizes

These parameters indicate that we want 2 instances, device 0 (GPU), and a work group for Matrix A of size 8.

We can see that most of GPU cycles are used in line 133 and 134 in the gemm_nn kernel function.

This time we will change TILE_GROUP_M to 16, to give Matrix A a work group size of 16.

Results Comparison:

Work Group Size: 8 16
Elapsed Time: 29.583s 27.252s
Kernel Function GPU Time: 12.202s 9.558s

Again, GPU In-kernel Profiling analyzes the GPU kernel execution per code line to find any performance issues which may be caused by inefficient kernel code algorithms or incorrect work item configuration.


VTune Amplifier provides GPU analysis for Android systems only on processors with Intel® HD Graphics and Intel® Iris® Graphics. GPU In-kernel Profiling is available on Windows* and Linux* OS, and has limited support on Yocto* OS (it needs the Intel GFX driver).


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804