Cookbook

  • 2020
  • 10/23/2020
  • Public Content
Contents

IO Issues: High Latency and Low PCIe* Bandwidth

This recipe uses Intel® VTune™ Amplifier's Disk IO analysis for a sample IO bound application and changes affinity for a PCIe device to increase read access bandwidth and get optimization.
Content experts
: Roman Sudarikov
Disk IO analysis was renamed to Input and Output analysis starting with Intel VTune Amplifier 2019.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.
  • Application
    :
    hdparm
    that performs sequential 128K read access during 3 seconds. The application is available at https://sourceforge.net/projects/hdparm.
  • Performance analysis tools
    :
    • Intel VTune Amplifier 2018: Disk Input and Output analysis
    • For
      VTune
      Profiler
      downloads and product support, visit https://software.intel.com/en-us/vtune.
    • All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
    • Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
  • Operating system
    : Red Hat* Enterprise Server 7.2
  • CPU
    : Intel microarchitecture code name Skylake
  • IO device specification
    : Intel® Solid State Drive Data Center Family for PCIe* P3500/P3600/P3700 Series

Run Disk Input and Output Analysis

For IO bound applications, you are recommended to start with the Disk Input and Output analysis:
  1. Click the
    New Project
    button on the toolbar and specify a name for the new project, for example:
    hdparm
    .
  2. In the
    Analysis Target
    window, select the
    local host
    target system type for the host-based analysis.
  3. Select the
    Launch Application
    target type and specify an application for analysis on the right pane.
  4. Click the
    Choose Analysis
    button on the right, select
    Platform Analysis > Disk Input and Output
    and click
    Start
    .
    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Analyze Bandwidth and Latency Metrics

Start your analysis with the
Summary
view that provides high-level statistics on the application execution. Focus on the
I/O Wait Time
metric, which is a primary indicator of I/O efficiency:
The I/O Wait Time metric shows that almost 30% of the Elapsed time the
hdparm
application was waiting on I/O.
Select the
read
Disk IO operation type on the histogram to analyze the read access time distribution:
Unsteady flow typically signals a performance degradation. This is also confirmed with the read access value, which is 3 orders of magnitude greater than what the device specification declares (20 usec).
Switch to the
Bottom-up
window and apply the
Storage Device/Partition
grouping level. Focus on the Timeline data:
The
I/O Operations
and
Data Transfers
sections of the Timeline view show high number of IO Waits and unsteady data flow.
The
PCIe Bandwidth
section shows that the read bandwidth of the device - local to
package_0
- is only about 65% of what the device specification claims.
Change the Timeline grouping to
Package / Core / H/W Context
to explore your application affinity:
You see that the application is running on
package_1
though the device is local to
package_0
. This could be the reason of high latency and lower than expected bandwidth.

Change Application Affinity and Re-run the Analysis

To solve the detected IO issues but keep the workload itself and device placement intact, change the application affinity and rerun the Disk Input and Output analysis.
The new result shows that the application is waiting on I/O operations only about 2% of the Elapsed time:
The histogram does not show read access time distribution anymore. All IO operations are executed in a sub-millisecond range:
The Timeline view now displays smooth data flows for IO operations and IO Data Transfers, which confirms that affinity optimization reduced the latency:
The change also increased the PCIe bandwidth to about 93% of what the device specification claims.

Take-Aways

Learn some key take-ways from IO performance analysis for PCIe bandwidth-bound applications:
  • Determine IO Unit (IOU) Affinity for PCIe devices.
  • Distribute applications to IO Units appropriately.
  • Learn performance capabilities of your device.
  • Set reasonable performance targets.
  • Run Disk Input and Output analysis to debug IO solutions with lower than expected bandwidth.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804