Detecting Disk I/O-bound Applications in Server Systems

In my previous blog, “Detecting CPU-bound Applications in Server Systems”, I discussed how to detect a CPU-bound application. I continue the performance analysis and debugging discussion here in my second blog called “Detecting Disk I/O Bound Applications in Server Systems”. Here I will show what common Linux* utilities can be used to detect I/O bound applications, and then I will cover what technologies are available to increase overall I/O performance of servers.

Analyzing Linux* servers for I/O bound applications

I conduct this test on a machine running Red Hat Enterprise Linux* 6.3, but you should be able to download all the utilities mentioned even if you work with other Linux* distributions. When intensive disk I/O applications run, they may consume almost all available disk I/O resources, which may result in other disk I/O dependent applications contending for this resource. Therefore intensive disk I/O applications can slow down the whole system. Also, when an intensive disk I/O application runs, since memory speed cannot keep up with CPU speed, most of its CPU time is spending in I/O waiting. As a system administrator, you need to identify the applications that consume too many disk I/O resources and to take the proper actions.

In this experiment, I will share with you the results when I run a disk I/O bound application and how you can detect such an application using simple Linux commands. First, I downloaded a file system benchmark tool called iozone (http://www.iozone.com) and installed it on my system equipped with a Hard Disk Drive (HDD). This tool can perform different combinations of read/write operations. I use the following command to run 4 iozone processes in order to generate intensive I/O activity on my hard disk drive, with a record size of 1Mb and file size of 1G. (Option -l indicates lower limit on number of processes, -u indicates higher limit on number of processes, -r indicates record size, -s indicates file size, and –F indicates the file names):

#iozone –l 4 –u 4 –r 1m –s 1G –F ./f1 ./f2 ./f3 ./f4

While the tool is running, I issue the command top which shows there is 14.9 % in I/O wait time (%wa); that is, the CPU is spending 14.9% of its time waiting for disk I/O as this application performs intensive I/O operations. When looking further down the list, you see iozone processes are the ones which consume most CPU cycles:

#top
top - 14:30:50 up 52 days,  6:02,  7 users,  load average: 5.13, 3.03, 1.25
Tasks: 766 total,   1 running, 765 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.7%sy,  0.0%ni, 84.3%id, 14.9%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49375504k total, 23847304k used, 25528200k free, 10796936k buffers
Swap: 68157432k total,        0k used, 68157432k free, 11075696k cached
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
124854 root      20   0 47228  18m  112 S  4.1  0.0   0:04.00 iozone           
124856 root      20   0 47228  18m  112 S  4.1  0.0   0:03.93 iozone           
124767 root      20   0     0    0    0 D  3.5  0.0   0:02.06 flush-8:0        
124853 root      20   0 47228  18m  112 S  3.5  0.0   0:03.90 iozone           
124855 root      20   0 47228  18m  112 S  3.5  0.0   0:03.85 iozone           
124851 root      20   0 15628 1824  964 S  0.6  0.0   0:00.57 top              
124876 root      20   0 15628 1788  940 R  0.6  0.0   0:00.04 top              
  3452 root      20   0     0    0    0 S  0.3  0.0   1:41.65 kondemand/16     
     1 root      20   0 19396 1564 1256 S  0.0  0.0   0:03.29 init             
     2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd         
     3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0      
     4 root      20   0     0    0    0 S  0.0  0.0   0:00.22 ksoftirqd/0      
     5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0      
     6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0       
     7 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1      
     8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1      
     9 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1      
    10 root      RT   0     0    0    0 S  0.0  0.0   0:00.08 watchdog/1       
    11 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2      
    12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2      
    13 root      20   0     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/2      
    14 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 watchdog/2       
    15 root      RT   0     0    0    0 S  0.0  0.0   0:00.88 migration/3      
   < truncate here>

Another useful tool is vmstat, vmstat displays information about memory, I/O, and CPU.  Wait time information (wa) is also shown here.

#vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo    in   cs us sy id wa st
 0  5      0 18501280 21129352 1326140    0    0    462    92    0    0  1  0 97  1  0 
 1  5      0 18500596 21129352 1326144    0    0      0 109030 400 7593  0  0 92  7  0 
 2  5      0 18501016 21129352 1326144    0    0      0  94111 334 6350  0  0 92  8  0 
 1  5      0 18498644 21129352 1326144    0    0      0 125976 402 7314  0  0 93  7  0 

To take a close look at which CPU is running the offending application, I use the command mpstat with the option “-P ALL”. This command shows the CPU information of each individual core. In this experiment, under the %iowait column, the corresponding cores running the application (CPU 8, 9, 12, 17, 24 and 25) have a high percent I/O wait. The following mpstat command with options updates the CPU information for all cores in the system every 5 seconds:

#mpstat –P ALL 5
Linux 2.6.32-279.el6.x86_64 (knightscorner4)  01/13/2014  _x86_64_ (32 CPU)
03:30:57 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
03:31:02 PM  all    3.24    0.00    0.63    9.10    0.00    0.03    0.00    0.00   87.01
03:31:02 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
03:31:02 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    2    3.40    0.00    0.60    0.00    0.00    0.00    0.00    0.00   96.00
03:31:02 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM    8    0.21    0.00    3.51   92.16    0.00    0.82    0.00    0.00    3.30
03:31:02 PM    9    0.00    0.00    1.40   41.28    0.00    0.00    0.00    0.00   57.31
03:31:02 PM   10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   11    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   12    0.00    0.00    7.57   20.12    0.00    0.00    0.00    0.00   72.31
03:31:02 PM   13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   14    0.00    0.00    0.00    4.21    0.00    0.00    0.00    0.00   95.79
03:31:02 PM   15    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   17    0.20    0.00    1.99   34.66    0.00    0.00    0.00    0.00   63.15
03:31:02 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   19    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   21    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   22    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   23    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   24    0.00    0.00    3.61   61.24    0.00    0.00    0.00    0.00   35.14
03:31:02 PM   25    0.00    0.00    0.80   39.76    0.00    0.00    0.00    0.00   59.44
03:31:02 PM   26    0.00    0.00    0.20    0.00    0.00    0.00    0.00    0.00   99.80
03:31:02 PM   27    0.20    0.00    0.40    0.00    0.00    0.00    0.00    0.00   99.40
03:31:02 PM   28    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   29    0.00    0.00    0.20    0.00    0.00    0.00    0.00    0.00   99.80
03:31:02 PM   30    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
03:31:02 PM   31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

To track down I/O usage by current processes or threads, we can use the command iotop . The first line displays the total read I/O bandwidth and the total write I/O bandwidth.  The column IO indicates the percent disk I/O usage of each process.

#iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 134.91 M/s
TID PRIO   USER DISK READ DISK WRITE  SWAPIN  IO    COMMAND
33110 be/4 root  0.00 B/s  0.00 B/s  0.00 % 82.57 % iozone -l 8 -u 8 -r 1m -s 1G -F ./f1 ./f2 ./f3 ./f4
33114 be/4 root  0.00 B/s  0.00 B/s  0.00 % 82.55 % iozone -l 8 -u 8 -r 1m -s 1G -F ./f1 ./f2 ./f3 ./f4
33112 be/4 root  0.00 B/s  0.00 B/s  0.00 % 82.54 % iozone -l 8 -u 8 -r 1m -s 1G -F ./f1 ./f2 ./f3 ./f4
12024 be/4 root  0.00 B/s 24.95 K/s  0.00 % 78.37 % [flush-8:16]
33116 be/4 root  0.00 B/s  0.00 B/s  0.00 % 76.44 % iozone -l 8 -u 8 -r 1m -s 1G -F ./f1 ./f2 ./f3 ./f4
 1265 be/3 root  0.00 B/s  0.00 B/s  0.00 % 28.22 % [jbd2/sdb2-8]
33105 be/4 root  0.00 B/s 21.20 K/s  0.00 %  0.00 % python /usr/bin/iotop -b -d 3
 4096 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [ext4-dio-unwrit]
    1 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % init
    2 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [kthreadd]
    3 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/0]
    4 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
    5 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/0]
    6 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [watchdog/0]
    7 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/1]
    8 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/1]
    9 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
   10 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [watchdog/1]
   11 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/2]
   12 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/2]
   13 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
   14 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [watchdog/2]
   15 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/3]
   16 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/3]
   17 be/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
   18 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [watchdog/3]
   19 rt/4 root  0.00 B/s  0.00 B/s  0.00 %  0.00 % [migration/4]
   <truncate here>

With these high I/O percentages, we can easily identify the application which is disk I/O bound. From the list above, we see that iozone threads are the ones which are performing intensive I/O activity in the system.

Technologies to help improve I/O throughput on servers

Solid-State Drives (SSD)

Hard Drives have I/O performance limitations because they are mechanical systems. When a process requests an I/O transaction such as a read operation, the system takes time to locate a disk block on the hard disk, reads the content, and gives it to the requesting process.

When many I/O intensive applications run on your server equipped with hard drives, overall system performance may suffer since HDD storage may not able to keep up with the demand rate placed on them by an even-faster system. If your application turnaround is frequently suffering from I/O latency, you may consider upgrading your server with Solid-State Drives (SSD). A SSD is a data storage device that uses semiconductor memory to store data. SSDs are faster, smaller, consume less power consumption, and are more resistant to shock than traditional HDDs.

SSDs also yield better performance than traditional HDDs. For example, boot time and data access are much faster and there is lower latency and higher reliability. For more information on Intel® SSDs, please refer to http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html?iid=subhdr+products_flash and http://www.intel.com/content/dam/www/public/us/en/documents/best-practices/accelerating-data-center-workloads-with-ssd.pdf

To compare HDD with SSD performance, I configure a system where a traditional HDD (WESTERN DIGITAL WD1002FAEX, SATA 6 Gb/s, 3.5”, 1.0 TB) is mounted on /dev/sdb and a SSD (Intel® SSD DC S3700 Series, SATA 6 Gb/s, 2.5", 800 GB) is mounted on /dev/sda. I can retrieve information of the HDD and SSD using the command hdparm. For example, to acquire information of the hard disk mounted on /dev/sdb, I issue the command:

#hdparm it /dev/sdb
/dev/sdb:
 Model=WDC, FwRev=05.01D05, SerialNo=WD-WMATR0328882
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7
 * signifies the current active mode
 Timing buffered disk reads:  338 MB in  3.01 seconds = 112.21 MB/sec

Similarly, I can get information on the SSD mounted on /dev/sda:

#hdparm it /dev/sda
/dev/sda:
 Model=INTEL, FwRev=5DV10265, SerialNo=BTTV3354018C800JGN
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1562824368
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7
 * signifies the current active mode
 Timing buffered disk reads:  622 MB in  3.00 seconds = 207.08 MB/sec

The hdparm command performs a quick test and shows that the reading time on the SSD (207.08 MB/sec) is much faster than the HDD (112.21 MB/sec) for that particular test.

Upgrading all HDD storage with SSDs in a large system can be very costly. Depending on your needs, you can preserve a large HDD install base and consider using Intel® Cache Acceleration Software to improve I/O performance capabilities in your servers.

Intel® Cache Acceleration Software (Intel® CAS)

Intel® Cache Acceleration Software technology takes advantage of SSD responsiveness to intelligently cache “hot” data, i.e., data that is used frequently. Only the less often used data, or “cold” data, is stored in HDD. This approach delivers higher performance at a fraction of the cost of fully upgrading storage to SSD for applications that consistently access specific data on disk.

To illustrate this point, on the same system used in my previous example, I install a HDD, a SSD, and the Intel CAS. Note that Intel CAS is installed on the HDD and uses the SSD to cache hot data only. I perform a random read test using the fio tool (see http://linux.die.net/man/1/fio) on the HDD and on the SSD and use Intel CAS to compare their performances. Note that a 32 Queue-depth result is obtained when setting             --iodepth=8 and --numjobs=4 and a 64 Queue-depth result is obtained when setting --iodepth=8 and --numjobs=8. The performance results are shown in the following graphs.

The data shows that I/O bandwidth when using Intel CAS is comparable to SDD (170 MB/sec vs 183 MB/sec in 32 Queue-depth test and 170 MB/sec vs 182 MB/sec in 64 Queue-depth test). Latency when using Intel CAS is also comparable to using SSDs (0.72 msec vs 0.66 msec in 32 Queue-depth test, and 1.40 msec vs 1.36 msec in 64 Queue-depth). These tests show random read using Intel CAS, the performance is close to that of SSD, and much better than HDD. This implies that if your system already has many HDDs installed, instead of upgrading them all to SSDs, you can use Intel CAS (and one SSD to cache hot data) to obtain comparable results with SSDs for certain types of applications.

In conclusion, the above commands allow us to detect I/O bound applications. To increase the I/O performance, we can replace HDDs with SSDs. In addition, for some applications, instead of upgrading the whole system to SSDs, we may just use the Intel CAS to get good performance.    

 

 

For more complete information about compiler optimizations, see our Optimization Notice.