IO Issues: High Latency and Low PCIe* Bandwidth

This recipe uses Intel® VTune™ Amplifier's Disk IO analysis for a sample IO bound application and changes affinity for a PCIe device to increase read access bandwidth and get optimization.

Content experts: Roman Sudarikov

  1. INGREDIENTS

  2. DIRECTIONS:

    1. Run Disk Input and Output analysis

    2. Analyze bandwidth and latency metrics

    3. Change application affinity and re-run the analysis

    4. Learn take-aways

Note

Disk IO analysis was renamed to Input and Output analysis starting with VTune Amplifier 2019.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

  • Application: hdparm that performs sequential 128K read access during 3 seconds. The application is available at https://sourceforge.net/projects/hdparm.

  • Performance analysis tools:

    • Intel® VTune™ Amplifier 2018: Disk Input and Output analysis

    Note

    • For trial VTune Amplifier downloads and product support, visit https://software.intel.com/en-us/vtune.

    • All the Cookbook recipes are scalable and can be applied to VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.

  • Operating system: Red Hat* Enterprise Server 7.2

  • CPU: Intel microarchitecture code name Skylake

  • IO device specification: Intel® Solid State Drive Data Center Family for PCIe* P3500/P3600/P3700 Series

Run Disk Input and Output Analysis

For IO bound applications, you are recommended to start with the Disk Input and Output analysis:

  1. Click the New Project button on the toolbar and specify a name for the new project, for example: hdparm.

  2. In the Analysis Target window, select the local host target system type for the host-based analysis.

  3. Select the Launch Application target type and specify an application for analysis on the right pane.

  4. Click the Choose Analysis button on the right, select Platform Analysis > Disk Input and Output and click Start.

    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Analyze Bandwidth and Latency Metrics

Start your analysis with the Summary view that provides high-level statistics on the application execution. Focus on the I/O Wait Time metric, which is a primary indicator of I/O efficiency:

The I/O Wait Time metric shows that almost 30% of the Elapsed time the hdparm application was waiting on I/O.

Select the read Disk IO operation type on the histogram to analyze the read access time distribution:

Unsteady flow typically signals a performance degradation. This is also confirmed with the read access value, which is 3 orders of magnitude greater than what the device specification declares (20 usec).

Switch to the Bottom-up window and apply the Storage Device/Partition grouping level. Focus on the Timeline data:

The I/O Operations and Data Transfers sections of the Timeline view show high number of IO Waits and unsteady data flow.

The PCIe Bandwidth section shows that the read bandwidth of the device - local to package_0 - is only about 65% of what the device specification claims.

Change the Timeline grouping to Package / Core / H/W Context to explore your application affinity:

You see that the application is running on package_1 though the device is local to package_0. This could be the reason of high latency and lower than expected bandwidth.

Change Application Affinity and Re-run the Analysis

To solve the detected IO issues but keep the workload itself and device placement intact, change the application affinity and rerun the Disk Input and Output analysis.

The new result shows that the application is waiting on I/O operations only about 2% of the Elapsed time:

The histogram does not show read access time distribution anymore. All IO operations are executed in a sub-millisecond range:

The Timeline view now displays smooth data flows for IO operations and IO Data Transfers, which confirms that affinity optimization reduced the latency:

The change also increased the PCIe bandwidth to about 93% of what the device specification claims.

Take-Aways

Learn some key take-ways from IO performance analysis for PCIe bandwidth-bound applications:

  • Determine IO Unit (IOU) Affinity for PCIe devices.

  • Distribute applications to IO Units appropriately.

  • Learn performance capabilities of your device.

  • Set reasonable performance targets.

  • Run Disk Input and Output analysis to debug IO solutions with lower than expected bandwidth.

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)