Figure 1: Filesystem choice significantly influences performance
We have compared the performance of Windows* and Linux*-based CIFS* (Samba*) servers for digital media applications and found that the ext3*-based Linux server’s throughput was up to 53% lower than the Windows server’s--although both used identical hardware (Figure 1). An XFS*-based Linux server had roughly the same performance as the Windows server. Our investigation shows that the difference lies in the filesystem allocation and handling of sparse files. In particular, the Windows client makes an assumption that the CIFS fileserver uses NTFS*, a filesystem that assumes files will be data-full (not sparse). This contradicts a fundamental assumption of ext3--that files are sparse--and leads to fragmentation of files and degraded performance on ext3. Further, we’ve seen this behavior manifested for a broad range of media applications including iTunes*.
The test system consists of two PCs directly connected over gigabit ethernet. A client PC running Windows XP SP2* connects to a server PC running either Windows XP SP2 or Openfiler* 2.2 (Linux 2.6, based on rPath Linux). The client maps a dedicated data disk on the server as a network drive using CIFS and generates repeatable media playback and record of high definition video streams. Our internally developed workload generator allows us to run the media streams unbounded: we record or replay the media as fast as the client is able and are not limited to the original media framerate. This allows us to stress the server with an aggressive workload. High definition video streams offer well-understood behavior: they are largely sequential, transfer large blocks of data, and are read-only (playback) or write-only (record).
Playback of these HD streams differs significantly depending on the server OS and filesystem. Playing back single or multiple simultaneous HD videos results in up to 53% lower data throughput when the server uses Openfiler/ext3 rather than Windows/NTFS. We have captured network traces, filesystem traces, and SATA traces to deeply characterize this performance difference. As would be expected, the network traces show little difference in the client-generated requests: the client is identical in both cases and it will generate the same requests regardless of the server. Server side traces of the filesystem and SATA interface reveal a significant difference in the operation of the two filesystems. In particular, Openfiler/ext3 exhibits a broad range of transfer sizes (Figure 2) and a significantly longer tail of service times than Windows/NTFS (Figure 3). These two observed trends point towards some kind of fragmentation on the Openfiler/ext3 disk (varied transfer sizes and long disk seeks) while the Windows/NTFS disk is largely contiguous and unfragmented (few transfer sizes and short, cached disk operations).
Figure 2: Openfiler uses a broad range of SATA sizes
Figure 3: WinXP frequently hits the SATA drive cache
Observing the recording of the HD streams sheds some light on why the Openfiler/ext3 disk becomes fragmented and the Windows/NTFS disk does not. On the Windows client network interface, we can observe a number of small (one-byte) writes to high offsets within the file at a long stride (128K bytes). These writes are generated by the client filesystem: they do not show up in application traces but do show up in network traces. These small writes to increasingly large offsets turn out to be bogus data which are used to provide hints to the server to pre-allocate data blocks on the disk. The data at these locations are eventually overwritten by large data writes of the actual HD recording. The behavior is illustrated conceptually in Figure 4 where the initial red blocks represent the pre-allocate writes and the later blue blocks represent the actual data written to the media file. If the server OS uses a filesystem like NTFS, these pre-allocate writes are used to allocate contiguous blocks on the disk. NTFS assumes that all files will be filled with data, so the small write to a high offset forces the filesystem to allocate a set of contiguous blocks up to that offset. When the valid data is written to NTFS, the blocks are already allocated, simplifying and speeding the data writes.
Figure 4: Pre-allocate writes are later overwritten with real data
Most Linux filesystems make the opposite assumption that files are sparsely populated with data. This results in poor behavior in response to the pre-allocate writes. The filesystem must assume that the data within the pre-allocate writes is valid: there is nothing in these writes to differentiate them from valid data. The filesystem (ext3) will therefore allocate blocks as they are written, in this case one byte at a time at 128KB stride. In practice, we have observed pre-allocate writes to offsets 15 MB into the file before a single valid data write occurs. When the application on the client begins writing valid data, the filesystem is already likely to be fragmented because of these pre-allocate writes. This fragmentation turns what should be a best-case workload for most modern hard disks (large, sequential reads) into a mess of low performance disk seeks. Anecdotally, we have observed that one 3.5GB video file used in our testing was fragmented into 49,986 extents. A simple copy to a new location defragmented this file to 28 extents (26 extents is ideal).
This behavior is not limited to our test case: a simple test using iTunes to rip a CD to a music library on a NAS device will show the same one-byte pre-allocate writes. We expect to see this behavior any time a Windows client is appending to a file of unknown length. Additionally, the Windows client behavior does not change in response to the Samba-advertised "fstype" parameter--the client issues pre-allocate writes whether the server announces itself as "NTFS" or "Samba." We have tried to force the linux server to allocate data-full files with the Samba &ldqu o;strict allocate” flag but observed no change in ext3 performance. (UPDATE 9/12/07: newer versions of Samba 3.0.20 or greater zero-fill files when “strict allocate” is set. This has the desired effect of producing NTFS-like behavior that works well with windows clients and improves media serving performance to be on par with XFS.)
We have seen that media files recorded from a Windows client and streamed/appended to an ext3-based Linux file server show lower delivered throughput (when played back unbounded) as compared to media recorded to an NTFS-based Windows file server. We attribute this performance delta to fragmentation within ext3 as a result of pre-allocate writes issued by the Windows client. When using XFS as a base Linux filesystem, we see similar performance to Windows (likely from XFS’ increased allocation size and delayed allocation). Additionally, if we defragment the media files on ext3 by copying them to a new location on the disk, playback performance improves to expected levels.
Ideally, the client would not assume the filesystem behavior on the server and would not issue these pre-allocate writes. This would allow the server to perform optimizations appropriate to the underlying filesystem. Modifying the Windows CIFS client, however, is not realistic. Modifying the server to discard the one-byte writes or delay them until overwritten by the real data may help in file fragmentation, but these solutions create data integrity issues. Another possibility would be to change ext3 so that it assumes data-full files.
More realistically, the server could recognize the small size, high offset writes and use them to implement block pre-allocation on the disk. When the server sees transactions such as those observed here to be pre-allocations, it would translate those into a pre-allocate mechanism specific to the underlying filesystem. If this were implemented at the Samba or VFS layer, the pre-allocate writes could translate into a larger number of smaller stride one-byte writes (for example, striding each 4KB disk block) to allow ext3 to properly allocate contiguous blocks on the disk. If the underlying filesystem has a specific pre-allocation mechanism (as has been proposed for ext4), the one-byte writes could be translated directly into this filesystem-native pre-allocation.
- Intel ® Pentium® 4 processor 3.6 GHz
- Intel ® D915GEV desktop board, 800 MHz system bus
- Integrated Marvell* Yukon* 1GbE
- 2x512 MB DDR2-533 4-4-4-12
- 1x Hitachi* Deskstar* 250 GB, 7200 RPM, SATA (NTFS)
- Windows XP SP2
- Intel ® Pentium® D processor 3.2 GHz 2x1MB L2
- Intel ® 955 chipset, 800 MHz system bus
- Integrated Intel ® PRO/1000 Gigabit Server Adapter
- 2x512MB DDR2-667 5-5-5-15
- WD* Raptor*, 74GB, 10,000RPM, SATA (OS drive, NTFS or ext3)
- Hitachi Deskstar 250GB, 7,200 RPM, SATA (data drive, NTFS, ext3, or XFS)
- Windows XP SP2 or Openfiler 2.2 (linux kernel 2.6.19, Samba 3.0.10)
*Other names and brands may be claimed as the property of others.