Power Analysis of Disk I/O Methodologies

by Karthik Krishnan and Jun De Vega


Introduction

Power characteristics during sequential/random reads and native command queuing, file fragmentation and disk thrashing, and power optimization guidelines.

Mobile users work with battery power, and extending the battery life has always been desirable to increase productivity and enhance user experience. Various applications constantly interact with the disk for read/write operations, and it is imperative that the software behaves in an optimal way to conserve battery power.

In this whitepaper, we shall analyze the power characteristics of the disk during sequential/random reads and native command queuing, and provide an analysis on file fragmentation and disk thrashing. This paper also provides guidelines on optimizing the power during disk I/O in various usage models along with the power impact. The sample code provided with this whitepaper consists of an in-house multithreaded prototype to perform bitmap to JPEG conversion, and APIs to detect a fragmented file and defragment it utilizing Native Command Queuing using async I/O.


Examining methodologies to optimize power

Disk Characteristics

First, let’s take a look at some of the characteristics of a hard drive. A hard drive is organized into multiple sectors, which are the most basic addressable blocks of storage. Multiple sectors are grouped into clusters, which the operating system efficiently uses to store and manage all files. The performance of a hard disk is related to various factors such as RPM, seek time, rotational latency, and the sustainable transfer rate. RPM is the rate at which a hard drive spins and directly impacts the performance. Seek time and rotational latency relate to the time taken by the disk to position the read/write head from its current track to the track and sector where data is to be read/written. Furthermore, the actual throughput of the system is also dependent on the physical location of the data on the drive. Since the angular velocity of the disk is constant, more data can be read from the outermost perimeter of the disk than the inner perimeter in a single rotation.

Every time a read request is placed by an application, the disk may have to be spun-up first, and the read/write head must be positioned at the appropriate sector. The data is then read and optionally placed in OS file system cache, and copied to the application buffer.

The following shows the relative time (in milliseconds) involved in these operations, based on the theoretical specification of the SATA drive used. While the actual numbers will vary between different drives, this gives a relative idea on the times taken.

(Platform Specs: Intel® Core™ Duo 2.0GHz, Jamison Canyon* CRB, 2x512MB DDR2, 40GB SATA 5400RPM-2.5’’ mobile, Windows* XP-SP2)

Seek Time Ave (ms) Latency Time Ave (ms) Spindle Start Time (ms)
12 5 .56 4000

 

It is clear that the spin-up time is significantly longer than the seek/latency time. Let’s look at the corresponding power consumed by the disk during these operations:

As evidenced in the chart above, the average power during spin-up is significantly greater than the power during actual data transfer (i.e., a read/write operation). Applications performing disk I/O should take into consideration these facts to optimize for the power and performance of disks. In the following section, we shall explore certain methodologies using sample code to optimize the overall performance and power.

Windows* APIs

The Win32* interface provides various APIs for I/O operations, but all the read/write calls are eventually routed through ReadFile()/WriteFile() APIs from kernel32.dll. Applications can perform I/O in either synchronous mode or asynchronous mode. In addition, Win32 provides various flags, giving control over file-system buffering, and passes on hints to the operation system regarding the type of I/O (e.g., sequential vs. random).

In synchronous mode, the application waits until the read/write request is complete and then continues processing. In asynchronous mode, the application submits the read/write request and continues processing while the I/O operation happens in the background. Once the background I/O operation is complete, the operating system will send a notification of I/O completion to the application. Overlapped I/O could be implemented in various ways, such as signaled file handles, signaled event objects, Asynchronous Procedure Calls (APC), and I/O completion ports.

Impact of Block Size on Sequential Reads

Let us examine the impact of read block sizes on sequential reads, regarding overall performance and power. For this analysis, we created a large file (around 1GB) and read the entire file in blocks of various sizes. As a general rule of thumb for any disk I/O, we rebooted the system between runs to avoid any file-system cache interference. The following shows the code flow for the read experiment:

HANDLE hFile = CreateFile(szLargeFile, GENERIC_READ, 0,

NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, NULL);


unsigned int filesize = GetFileSize(hFile,(LPDWORD)0);


totalread = 0;
	
nrem = filesize;


for (totalread = 0; totalread <filesize; totalread += chunksize)

{

if (nrem >= chunksize) 

{

ReadFileChunk(hFile,pAlignedBuf,chunksize);

}

else

{

ReadFileChunk(hFile,pAlignedBuf,nrem);

}

nrem -= chunksize;

}

 

The following shows the impact of various block sizes on both sequential I/O bandwidth and CPU utilization:

The bandwidth significantly increases as we change from smaller block sizes to larger ones, and the corresponding processor utilization reduces as we move from larger block reads. The following table shows the relative comparison o n the total energy consumed for reading the large file with various block sizes (energy was calculated for the duration of the total read times using the NetDAQ* methodology mentioned before):

CPU_ENERGY (mWH) DISK_ENERGY (mWH)
1b 9658.77 1056.0
8b 1336.18 192.32
1KB 24.93 13.76
4KB 24.05 13.56
8KB 23.27 13.23
16KB 22.46 12.98
32KB 22.49 12.85
64KB 22.50 12.96

 

Clearly there is a significant difference between 1b and large block reads, but the difference is not as prevalent between various large block reads. This is due to the fact that the energy numbers have been calculated in mWH and the workload itself runs for a small amount of time with larger blocks. In general, it is clear that the larger blocks are ideal for both power and performance.


Recommendation for sequential reads

It is recommended that applications use 8KB transfers for the best performance. This may mean buffering more data or consolidating multiple reads for processing in certain applications such as media playback. This also reduces frequent spin-ups that may be needed when the data is read in smaller chunks. Larger chunks have a significantly longer spin-up time and there is more energy associated with each spin-up. As evidenced in the chart above, the transfer rate may not show a noticeable difference beyond a certain threshold (8KB in this case); therefore, choosing a huge block size (in MBs) is not recommended.

Case Study: Multimedia Playback

A typical MP3 player application reads several small portions of t he file from the hard disk (~2KB for the 4MB test file) when playing an MP3 music file. The hard disk is kept busy doing lots of small reads that keep the hard-disk average power to above 1.25W (see below).

Following the recommendation of buffering larger data or consolidating multiple reads to have a more power-efficient hard disk, the MP3 file was copied (buffered) to a RAMDisk (memory). As seen on the chart below, initial buffering of the MP3 file to memory kept the hard disk busy, but once the data is all buffered and read from memory, the hard disk stays idle and significantly reduces its average power to 0.8W.

Impact of File Fragmentation

The term “file fragmentation” refers to the case of having the contents of a file not located in contiguous clusters of disk. When a file is created for the first time, it is highly likely that the operating system is able to allocate contiguous clusters of storage, but as more content is added by incrementally appending data, the file may end up with fragmented clusters since neighboring clusters may not be available.

The following code shows an example of this. Two files are being appended with data in 64KB blocks, one after another. Since both the files are being appended in successive calls and the operating system allocates clusters in a specific order, both the files end up being fragmented on 64KB boundaries.

for (int i=0; i>MAX_SIZE; i += 64KB)

{

AppendFile(“Fragmented_File0”,64KB,...);

AppendFile(“Fragmented_File1”,64KB,...);

}


void AppendFile(char *szFilename,...)

{

...

HANDLE hFile = CreateFile(szFilename,  
   
FILE_APPEND_DATA,

FILE_SHARE_READ,

NULL,OPEN_ALWAYS,

FILE_ATTRIBUTE_NORMAL, 

NULL);



(void)SetFilePointer(hFile, 0, NULL, FILE_END);

WriteFile(hFile,...);

CloseHandle(hFile);

...

}

 

File fragmentation causes serious problems from a performance point of view, as well as from how it affects the user’s experience. Let us first look at performance impact.

Sequentially reading a fragmented file will take much longer than reading a defragmented file. This is due to the seek time and rotational latency penalty incurred while gathering data from non-contiguous clusters. This latency is greatly minimized in a defragmented file since the data is in contiguous clusters.

Below is an example of reading a 256MB file that was initially fragmented and later defragmented. It took more than twice as much time to read a fragmented file and caused a significant increase in total energy for the same task.


Recommendations to avoid fragmentation cost

Fragmentation caused when extending files can be avoided by pre-allocating large sequential files when they are created. The idea is to estimate the size of file in advance and let the file system optimally pre-allocate the largest possible cluster so that the content of the file will be held in contiguous locations. If you are using .NET* framework, the SetLength() method can be used as below.

FileStream fs = new FileStream(filename,FileMode.OpenOrCreate);
fs.SetLength(256*1024);

 

The cost of fixing a fragmented file one time far outweighs the energy penalty associated with multiple access times. One user-level option to fix would be to use back-end defragmenters more effectively and periodically. Microsoft Windows* provides few undocumented interfaces to detect if a given file is fragmented or not, and even to defragment a fragmented file. The NtFsControlFile() is the key function that can be used to detect the number of clusters in a given file, the logical cluster numbers (LCNs) to locate and allocate free large clusters, and move a fragmented file. Developers are advised to use these APIs at their own risk since these are undocumented by the operating system.

Impact of Native Command Queuing on Random Reads

Native command queuing refers to a command reordering technique that can often yield significant performance and power improvements. While the experiments were conducted on a SATA NCQ drive, similar benefits are expected with other reordering protocols as well. NCQ is often referred to as the “elevator algorithm,” and it reorders the commands received in an optimal order to reduce seek/rotational latencies. With NCQ, the drive uses the current location of the read/write head and determines the most efficient path for executing the commands, much like the elevator. Without reordering, the disk will honor the commands in the order received, and may end up making suboptimal rotations that cause a decrease in performance and an increase in power consumption. An in-depth whitepaper on NCQ is available here.

To take advantage of NCQ, the read operations have to be queued. Synchronous I/O will not take advantage of NCQ since each I/O call is blocked and the read/write operation will be honored in the order received.

Let us look at simulating the effect of NCQ. The following code shows a snippet of choosing a random file and random offset:

int chunksize = 64*1024;

for (int i=0; i>nFiles; ++i)

{

HANDLE hFile = CreateRandomFile();

ReadFile(hFile,...); //blocked I/O

}


HANDLE CreateRandomFile()

{

...

int n = GetRandomNumber(0,file_max-1);

HANDLE hFile = CreateFile(szFilename[n], GENERIC_READ, 0,

NULL, OPEN_EXISTING,
 
FILE_ATTRIBUTE_NORMAL,NULL);

int filesize = GetFileSize(hFile,0);

int  offset = GetRandomNumber(0,filesize-chunksize);

SetFilePointer(hFile,offset,NULL,FILE_BEGIN); 

...


return hFile;


}

 

For our analysis, we chose a random set of files and started reading at 64KB from a random offset of each file. The above cod e shows the case when synchronous I/O was used. In the sample below, the same read requests are queued up initially and later wait on the I/O completion port until all the read requests are complete:

int chunksize = 64*1024;

HANDLE gPort;

for (int i=0; i>nFiles; ++i)

{

HANDLE hFile = CreateRandomFile();

QueueRead(hFile,gPort,...); //blocked I/O

}


While (nRead > nFiles)

{

GetQueuedCompletionStatus(gPort,...);

...

...

++nRead;

}

 

Note that in both the cases the same set of data is being read, since the random numbers are seeded the same way. The following shows that when NCQ was utilized, the total time reduced by ~15% and there was a similar reduction in total energy for the task.

Recommendation on Utilizing NCQ

Applications that deal with random I/O or I/O operations with multiple files should use asynchronous I/O to take advantage of NCQ. Queue up all the read requests and use events or callbacks to determine if the read requests are complete. This method can potentially give a ~15% platform power and a ~2x performance improvement.


Disk I/O in Multithreaded Code

Let’s look at the effect of I/O on multithreaded code, and the best strategy for I/O when multiple threads require data from the disk. For this analysis, the authors of this article developed an in-house bitmap-to-JPEG code based on the IJG library that will convert a set of BMPs to JPEGs. The following shows the serial version of the application. The ‘read’ box refers to reading the bitmap file in block sizes of 64KB (the same block size while writing the .JPG files out). Read/write time contributes ~41% and BMP2JPEG contributes ~59% of the total run time. The output JPEG files are quite small compared to the read files (the BMP files were ~10MB each); hence, most of the read/write time was consumed by reads.


One way to multithread on a dual-core machine is to spawn two threads. One thread can take ownership for odd-numbered files and the other can take ownership for even-numbered files, as shown below:

Another way to multithread would be to perform the buffer read/write in a single thread, and then handle the computationally intensive convert operation in multiple threads:

The final methodology we will include uses queued I/O. An I/O completion port is created where all the read requests are queued, and multiple threads are made to wait on the completion port when the read requests are complete. The conversion operation is initiated and the entire output file is buffered and written out, finally, in one thread. This method provides load balancing of the system along with the possibility of scaling the number of threads easily.

The following shows the performance of these multithreaded methodologies:

Performance when using the multithreaded implementation with buffered I/O and queuing I/O shows (~1.52x-1.56x scaling), and performance does not scale when the I/O operations were also performed in both the threads. This poor performance is due to disk thrashing by the competing threads. When both of the threads are running and the disk tries to read 64KB chunks of data simultaneously, there is a severe seek/rotational penalty incurred in the hard drive. Since requests from both the threads need to be serviced, the read head has to be repositioned after every read of buffer block. This causes suboptimal disk rotations and an increase in overall read time, and masks the performance benefit of multithreading. The energy chart below shows that queuing up I/O provides the best power as well as performance optimizations:


Conclusion

If you have multiple threads competing simultaneously for disk I/O, one option would be to queue the I/O calls and utilize NCQ. Reordering may help in optimal ordering of the requests from multiple threads and improve performance. If multiple threads competing for the disk causes significant disk thrashing, consolidate all the read/write operations in a single thread; it will reduce read/write head thrashing and reduce frequent disk spin-ups as well.


References

 


About the Authors

Karthik Krishnan is an applications engineer working for Intel's Software and Solutions group. He joined Intel in 2001 and has been working with various software vendors to optimize their products on Intel® Mobile and desktop platforms. Prior to joining Intel, he has worked for Fluent Inc. as a software developer dealing with parallel programming.

Jun De Vega is an Application Engineer in Intel’s Software and Solutions Group, working on application tuning and optimization for Intel® architecture. He supports enabling of ISV applications on Intel® Mobile and Desktop Platforms.

 


Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.