This article will introduce the GPU In-kernel Profiling feature in Intel® VTune™ Amplifier, using one of the OpenCL™ samples from the Intel® SDK for OpenCL™ Applications. We will cover a brief introduction to OpenCL and the Intel SDK for OpenCL Applications, as well as walking through the process of profiling an OpenCL application using VTune Amplifier's GPU profiling features.
OpenCL is the open standard for parallel programming of heterogeneous systems:
OpenCL™ (Open Computing Language) is the open, royalty-free standard for cross-platform, parallel programming of diverse processors found in personal computers, servers, mobile devices and embedded platforms. OpenCL greatly improves the speed and responsiveness of a wide spectrum of applications in numerous market categories including gaming and entertainment titles, scientific and medical software, professional creative tools, vision processing, and neural network training and inferencing.
It is the framework for cross-platform parallel programming such as CPU, GPU, and FPGA. When you are developing software for heterogeneous systems and have a workload best suited to something other than a CPU, like GPGPU, OpenCL is the right solution.
Intel is a strong supporter of OpenCL™ technology. The Intel® SDK for OpenCL™ Applications is a comprehensive development environment for developing and optimizing OpenCL™ applications on Intel® platforms, and part of an increasingly rich portfolio of Intel tools for heterogeneous programming. The SDK supports offloading compute-intensive parallel workloads to Intel® Graphics Technology using an advanced OpenCL™ kernel compiler, runtime debugger and code performance analyzer. The SDK and driver/runtime packages are installed separately.
The first step of OpenCL profiling is a VTune™ Amplifier GPU Hotspots analysis. Once you identify the hottest OpenCL kernel, you need to investigate this kernel with GPU in-kernel profiling. GPU In-kernel Profiling analyzes GPU kernel execution per code line to find any performance issues which may be caused by inefficient kernel code algorithms or incorrect work item configuration.
The key metrics of GPU In-kernel Profiling are:
The General Matrix Multiply (GEMM) sample demonstrates how to efficiently utilize an OpenCL* device to perform a general matrix multiply operation on two dense square matrices. General Matrix Multiply is a subroutine that performs matrix multiplication:
C := alpha*A*B + beta*C
where A, B and C are dense matrices and alpha and beta are floating point scalar coefficients. The sample supports single-precision and double-precision data types for matrix elements (as well as alpha and beta constants).
These parameters indicate that we want 2 instances, device 0 (GPU), and a work group for Matrix A of size 8.
We can see that most of GPU cycles are used in line 133 and 134 in the gemm_nn kernel function.
This time we will change TILE_GROUP_M to 16, to give Matrix A a work group size of 16.
|Work Group Size:||8||16|
|Kernel Function GPU Time:||12.202s||9.558s|
Again, GPU In-kernel Profiling analyzes the GPU kernel execution per code line to find any performance issues which may be caused by inefficient kernel code algorithms or incorrect work item configuration.
VTune Amplifier provides GPU analysis for Android systems only on processors with Intel® HD Graphics and Intel® Iris® Graphics. GPU In-kernel Profiling is available on Windows* and Linux* OS, and has limited support on Yocto* OS (it needs the Intel GFX driver).
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804