How to use Disk I/O analysis in Intel® VTune™ Amplifier for systems

Published:09/12/2016   Last Updated:09/11/2016

Introduction

The Intel® VTune™ Amplifier 2017 has a new feature called disk input and output analysis that can be used to analyze disk-related performance issues based on device utilization, latency of requests and bandwidth to the device.  This provides a consistent view of the storage subsystem combined with the hardware events like device queue utilization, I/O transfer rate and an easy to use method to match user-level source code with I/O packets executed by hardware.

Overview

To access VTune Amplifier’s disk I/O feature click on “Disk Input and Output” analysis type under “Platform Analysis” in the Analysis Type tab. 

The article uses a simple file copy example that reads from an input file of size say 1G and does a checksum and writes to an output file to illustrate the disk I/O analysis. Here is a snippet of the code:

 infile = fopen(infilename, "rb");
    if (infile == NULL) {
        fprintf (stderr, "%s can't be opened.\n", infilename);
        return 0;
    }
    outfile = fopen(outfilename, "wb");
    srand(time(NULL));
    MD5_Init(&context);
    while ((bytes_read = fread(buffer, 1, buffer_size, infile)) != 0) {
        MD5_Update(&context, buffer, bytes_read);
        fwrite(buffer, 1, bytes_read, outfile);
        buffer_size = ((((rand() % 100) + 1) / 100.0) * MAX_BUFFER_SIZE);
    }

You can either use the GUI or the command line as below to perform collection for Disk I/O.

amplxe-cl –collect disk-io target-appl.exe

Once the collection is finished the summary window opens up as shown below:

The summary window indicates if your application is I/O bound or CPU bound and also the disk input and output histogram plots the read, write and flush operations for the file copy application in terms of duration (fast, good, slow) in seconds. The triangles at the bottom of the fast, good and slow indicators can be moved to suit the user’s needs.  As can be seen for the write operation, there are upto 56 operations that are qualified as slow. Similar information can be gathered for the other I/O operations like reads and flush as well.

For further analysis of your application, move to the bottom-up tab. The top panel of the bottom up tab indicates the top hotspot functions along with a breakdown of the I/O operations for each function as shown in the highlighted box. The top hotspot in filecopy function has 2 slow reads which can probably be optimized. Double clicking on the function would take you to the source code window which can be used to narrow down the lines of code that can be optimized.

The grouping option can also be changed to storage device/partition and this will give a breakdown by the disk and can be used to identify utilization/latency issues when there are multiple disks.

To get a better understanding switch to the platform tab.   

With the thread checkbox enabled this provides a timeline for CPU time spent, context switches, I/O APIs and the slow tasks for your application. As can be seen the I/O wait times contribute to a significant part of the timeline indicating the application is mostly I/O bound.  Also, if you hover over the timeline a pop-up appears that indicates the operation, duration and reason for the wait at that point in the timeline. The highlighted box shows the 2 slow reads that were a part of the top hotspot in the bottom up pane.  You can filter in selection for the selected time range and all the other view windows will be updated for that range. 

The I/O queue depth provides an indication of the number I/O requests submitted to the storage device. This gives an idea of the disk utilization over the lifespan of the application.  Enabling the slow spike indicates where exactly the slow packets are executed. 

The I/O data transfer shows the number of bytes read and written to the disk indicating points of high bandwidth utilization of the disk.  

Conclusion

With all this information, issues like imbalance between compute and I/O, latency and utilization can be identified and appropriate optimizations for the same can be done.  A few options that can be considered for overcoming I/O issues are changing application logic to run compute threads in parallel with IO, changing size of I/O operations or using faster storage.  

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804