Intel® VTune™ Profiler Functionality on AWS* Instances

Published:09/09/2019   Last Updated:04/10/2020

Introduction

Intel® VTune™ Profiler is a performance profiling tool that delivers software and hardware performance analysis through its graphical and command line interface. There are three general types of data it collects:

  1. Software (user-mode hotspots and threading) - these collections are generally software-based and do not rely on availability of hardware events
  2. Hardware (event-based hotspots and threading, microarchitectural analysis, and HPC characteristics) - these collections are hardware-based and require the availability of some hardware events
  3. Memory (memory access and bandwidth analysis) - this collection is hardware-based and requires the availability of events that occur outside of the CPU (uncore events)

Amazon Web Services* (AWS*) provides a large variety of instance types and sizes for users in its Elastic Compute Cloud* (EC2*) service. Some VTune Profiler collection types will be unavailable on certain instances due to the hypervisor not providing the necessary hardware counters.

Instances Tested

VTune Amplifier Functionality by Instance Type
Instance VTune Profiler Collections Supported Application Performance Snapshot Supported?
c5.xlarge Software only No
c5.9xlarge Software, Hardware Yes
c5.12xlarge Software, Hardware Yes
c5.18xlarge Software, Hardware Yes
c5.24xlarge Software, Hardware Yes
c5.metal All Yes
m5.4xlarge Software only No
m5.8xlarge Software only No
m5.12xlarge Software, Hardware Yes
m5.16xlarge Software only No
m5.24xlarge Software, Hardware Yes
m5.metal All Yes
r5.8xlarge Software only No
r5.12xlarge Software, Hardware Yes
r5.16xlarge Software only No
r5.24xlarge Software, Hardware Yes
r5.metal All Yes

Instance Description

The instances tested include C5, R5, and M5 instances of various sizes. These all use Intel® Xeon® Scalable Processors (codename Skylake and Cascade Lake). The C5 instances are compute optimized meaning they deliver efficient and cost effective performance. The R5 instances are memory optimized so they are able to handle large amounts of memory and deliver effective performance. The M5 instances are general purpose meaning they deliver performance optimizing memory, computing power and network resources.

Performance Monitoring Unit (PMU)

The PMU is on-chip hardware that monitors micro architectural events such as cache misses, cache hits and elapsed cycles. It also analyzes how the operating system or application performs on the processor. The PMU consists of two main types of events, hardware and software. The hardware event includes instructions, CPU cycles and cache references, and the software event includes context switches and page faults.

VTune Profiler has two ways of collecting on these events in Linux*:

  • Linux Perf* tool - an interface that provides access to the PMU and its features. Perf also provides modes such as event-based sampling (EBS) which records when a threshold number of events is reached. Perf is already installed on the default kernel.
  • VTune Profiler's sep driver - provided as part of the VTune Profiler package and installed if PMU access is detected. If VTune Profiler is unable to use the sep driver, it will collect using perf. The sep driver is only supported on metal instances at this time.

Instances without Full PMU Support

VTune Profiler analysis types such as the Additional Insights on Hotspot Analysis, Microarchitecture Exploration and HPC Performance Characterization require access to PMU events in order to provide hardware data such as instructions retired and number of cycles. The PMU events accessible on AWS* instances depends largely on the instance size. The instances tested run on Intel Xeon Scalable Processors with two sockets. Only instance sizes that use one or both complete sockets allow for PMU access, presumably because partial use of a socket results in shared CPU resources. Of the larger instances tested, the M5.16xlarge and R5.16xlarge instances do not support PMU events because they consume one complete socket and a portion of the second. Therefore they do not allow for the hardware analyses to take place.

Intel VTune Profiler - Application Performance Snapshot

Application Performance Snapshot (APS) is a utility packaged with VTune Profiler for Linux*. It provides the ability to quickly visualize MPI and OpenMP imbalances, efficiency of memory access, floating point unit (FPU), I/O and memory data in your application. After analyzing this data, it displays ways to perform additional analysis with VTune Profiler.

APS has the same limitations as VTune Amplifier hardware analysis types. It can only run when PMU events are accessible.

Intel VTune Profiler - Platform Profiler

The VTune Profiler Platform Profiler utility is also packaged with VTune Profiler. It profiles at the system level to help identify hardware configuration issues such as storage layout, memory and disk I/O, CPU frequency, cycles per instruction (CPI), power consumption and many more.

Platform Profiler is limited to use on metal instances only.

Metal versus Non-metal Instances

Some instance types have a metal offering that is the same size as the largest non-metal instance. For example, c5.24xlarge has the same number of vCPUs as c5.metal, and appears to utilize the same hardware. The main difference is that the 24xlarge instance still uses a hypervisor which prevents full access to the PMU, including uncore events used in memory access analysis. The result is that VTune Profiler will still be limited on the largest non-metal instance, and fully functional on the metal equivalent.


 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804