User Guide

Contents

Platform Level Profiling

Understand the platform-level metrics provided by the Input and Output analysis of
Intel® VTune™
Profiler
.
The Input and Output analysis type provides the following metrics for platform-level analysis:
  • PCIe Bandwidth
  • DRAM Bandwidth
  • Intel® Ultra Path Interconnect (Intel® UPI) Utilization
Use these metrics to:
To analyze these kinds of performance issues, run the
Input and Output
analysis with the following options enabled:

Analyze PCIe Traffic

Start your investigation with the Summary window that displays total Inbound and Outbound PCIe traffic:
  • Inbound PCIe Bandwidth
    is induced by PCIe devices that write and read to and from the system memory. These metrics are only available for server platforms based on the Intel® microarchitecture code named Sandy Bridge EP and later.
    • Inbound PCIe Read
      — the PCIe device reads from the platform memory.
    • Inbound PCIe Write
      — the PCIe device writes to the platform memory.
  • Outbound PCIe Bandwidth
    is induced by core transactions targeting the memory or registers of the PCIe device. Typically, the core accesses the device memory through the Memory-Mapped I/O (MMIO) address space. These metrics are only available for server platforms based on the Intel® microarchitecture code named Broadwell EP and later.
    • Outbound PCIe Read
      — the core reads from the registers of the device.
    • Outbound PCIe Write
      — the core writes to the registers of the device.
Starting with server platforms based on the Intel® microarchitecture code name Skylake,
Inbound and Outbound PCIe Bandwidth
metrics can be collected per-device. To get per-device metric attribution, load the sampling driver or run
VTune
Profiler
as root.
You can analyze the
Input and Output PCIe Bandwidth
over time on a per-device basis using the timeline in the
Bottom-up
or the
Platform
tabs:

Analyze Efficiency of Intel® Data Direct I/O Utilization

To understand whether your application utilizes Intel® DDIO efficiently, explore the
L3 Hit/Miss Ratios
for
Inbound PCIe requests
.
L3 Hit/Miss metrics are available for Intel® Xeon® Scalable processors, 1st and 2nd generation. The sampling driver must be loaded.
For a detailed explanation of Intel® DDIO utilization efficiency, see the Effective Utilization of Intel® Data Direct I/O Technology Cookbook recipe.
The values of these metrics are available in the
Summary
tab:
You can also get a per-device breakdown for
Inbound and Outbound Traffic
and
Inbound request L3 hits and misses
using the
Bottom-up
pane with
Package/M2PCIe
grouping:

Analyze MMIO Access

Outbound PCIe traffic visible in the
PCIe Traffic Summary
section of the
Summary
tab is caused by cores writing and reading to and from memory and/or registers of PCIe devices.
Typically, cores access PCIe device memory through the Memory-Mapped I/O (MMIO) address space. Each load or store operation targeting the MMIO address space that a PCIe device is mapped to causes outbound PCIe read or write transactions respectively. Such loads and stores are quite expensive, since they are affected by the PCIe device access latency. Therefore, such accesses should be minimized to achieve high performance.
Use the
MMIO Access
section to locate functions performing
MMIO Reads
and
MMIO Writes
that target specific PCIe devices.
Use the
Bottom-up
pane to locate sources of memory-mapped PCIe device accesses. Explore the call stacks and drill down to source and assembly view.
Double click on the function name to drive into source code or assembly view to locate the code responsible for MMIO reads and writes at source line level:
MMIO access data is collected when the
Analyze PCIe Bandwidth
check box is selected. However, there are some limitations:
  • This feature is only available starting with server platforms based on the Intel® microarchitecture code name Skylake.
  • Only
    Attach to Process
    and
    Launch Application
    collection modes are supported.

Analyze Memory and Cross-Socket Bandwidth

Non-optimal application topology can result in induced DRAM and Intel® QuickPath Interconnect (Intel® QPI) or Intel® Ultra Path Interconnect (Intel® UPI) cross-socket traffic, which can limit performance.
Use the
Platform
tab to correlate
Inbound PCIe Traffic
with DRAM and cross-socket interconnect bandwidth consumption:
VTune
Profiler
provides per-channel breakdown of DRAM bandwidth.
Two metrics are available for UPI traffic:
  • UPI Utilization Outgoing
    – ratio metric that shows UPI utilization in terms of transmit.
  • UPI Bandwidth
    – shows detailed bandwidth information with breakdown by data/non-data.
You can get a breakdown of UPI metrics by UPI links. See the specifications of your processor to determine the number of UPI links that are enabled on each socket of your processor.
UPI link names reveal the topology of your system by showing which sockets and UPI controllers they are connected to.
Below is an example of a result collected on a four-socket server powered by Intel® processors with microarchitecture code named Skylake. The data reveals significant UPI traffic imbalance with bandwidth being much higher on links connected to socket 3:

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.