Platform Level Profiling
Understand the platform-level metrics provided by the Input and Output analysis of
Intel® VTune™
.
Profiler
The Input and Output analysis type provides the following metrics for platform-level analysis:
- PCIe Bandwidth
- DRAM Bandwidth
- Intel® Ultra Path Interconnect (Intel® UPI) Utilization
Use these metrics to:
- Analyze PCIe traffic
- Analyze efficiency of Intel® Data Direct I/O technology (Intel® DDIO) utilization
- Monitor DRAM bandwidth consumption
- Identify I/O performance issues potentially caused by inefficient remote socket accesses
- Identify sources of accesses resulting in Outbound PCIe traffic
To analyze these kinds of performance issues, run the
Input and Output
analysis with the following options enabled:

Analyze PCIe Traffic
Start your investigation with the
Summary window that displays total Inbound and Outbound PCIe traffic:
- Inbound PCIe Bandwidthis induced by PCIe devices that write and read to and from the system memory. These metrics are only available for server platforms based on the Intel® microarchitecture code named Sandy Bridge EP and later.
- Inbound PCIe Read— the PCIe device reads from the platform memory.
- Inbound PCIe Write— the PCIe device writes to the platform memory.
- Outbound PCIe Bandwidthis induced by core transactions targeting the memory or registers of the PCIe device. Typically, the core accesses the device memory through the Memory-Mapped I/O (MMIO) address space. These metrics are only available for server platforms based on the Intel® microarchitecture code named Broadwell EP and later.
- Outbound PCIe Read— the core reads from the registers of the device.
- Outbound PCIe Write— the core writes to the registers of the device.
Starting with server platforms based on the Intel® microarchitecture code name Skylake,
Inbound and Outbound PCIe Bandwidth
metrics can be collected per-device. To get per-device metric attribution,
load the sampling driver or run
VTune
as root.
Profiler
You can analyze the
Input and Output PCIe Bandwidth
over time on a per-device basis using the timeline in the
Bottom-up
or the
Platform
tabs:

Analyze Efficiency of Intel® Data Direct I/O Utilization
To understand whether your application utilizes Intel® DDIO efficiently, explore the
L3 Hit/Miss Ratios
for
Inbound PCIe requests
.
L3 Hit/Miss metrics are available for Intel® Xeon® Scalable processors, 1st and 2nd generation. The sampling driver must be loaded.
For a detailed explanation of Intel® DDIO utilization efficiency, see the
Effective Utilization of Intel® Data Direct I/O Technology Cookbook recipe.
The values of these metrics are available in the
Summary
tab:

You can also get a per-device breakdown for
Inbound and Outbound Traffic
and
Inbound request L3 hits and misses
using the
Bottom-up
pane with
Package/M2PCIe
grouping:

Analyze MMIO Access
Outbound PCIe traffic visible in the
PCIe Traffic Summary
section of the
Summary
tab is caused by cores writing and reading to and from memory and/or registers of PCIe devices.
Typically, cores access PCIe device memory through the Memory-Mapped I/O (MMIO) address space. Each load or store operation targeting the MMIO address space that a PCIe device is mapped to causes outbound PCIe read or write transactions respectively. Such loads and stores are quite expensive, since they are affected by the PCIe device access latency. Therefore, such accesses should be minimized to achieve high performance.
Use the
MMIO Access
section to locate functions performing
MMIO Reads
and
MMIO Writes
that target specific PCIe devices.

Use the
Bottom-up
pane to locate sources of memory-mapped PCIe device accesses. Explore the call stacks and drill down to source and assembly view.

Double click on the function name to drive into source code or assembly view to locate the code responsible for MMIO reads and writes at source line level:

MMIO access data is collected when the
Analyze PCIe Bandwidth
check box is selected. However, there are some limitations:
- This feature is only available starting with server platforms based on the Intel® microarchitecture code name Skylake.
- OnlyAttach to ProcessandLaunch Applicationcollection modes are supported.
Analyze Memory and Cross-Socket Bandwidth
Non-optimal application topology can result in induced DRAM and Intel® QuickPath Interconnect (Intel® QPI) or Intel® Ultra Path Interconnect (Intel® UPI) cross-socket traffic, which can limit performance.
Use the
Platform
tab to correlate
Inbound PCIe Traffic
with DRAM and cross-socket interconnect bandwidth consumption:

VTune
provides per-channel breakdown of DRAM bandwidth.
Profiler
Two metrics are available for UPI traffic:
- UPI Utilization Outgoing– ratio metric that shows UPI utilization in terms of transmit.
- UPI Bandwidth– shows detailed bandwidth information with breakdown by data/non-data.
You can get a breakdown of UPI metrics by UPI links. See the specifications of your processor to determine the number of UPI links that are enabled on each socket of your processor.
UPI link names reveal the topology of your system by showing which sockets and UPI controllers they are connected to.
Below is an example of a result collected on a four-socket server powered by Intel® processors with microarchitecture code named Skylake. The data reveals significant UPI traffic imbalance with bandwidth being much higher on links connected to socket 3:
