Performance comparison of the Cluster File Systems at the Intel CRT-DC

Executive summary

The Intel Customer Response Team Data Center (called CRT-DC) located in DuPont, Washington is running benchmarking data center with >450 compute nodes. The cluster is known as Endeavour, rebuild on a regular basis with latest hardware and has been listed in Top 500 since 2006 (best position #68). To satisfy the storage needs 4 cluster file systems are currently in use. Two are commercial solutions from Panasas and DDN, and 2 systems have been built in house from commercial-off-the-shelf (COTS) hardware. This paper compares the performance of these systems in a multi-node test using up to 64 clients.
 

Acknowledgments

The author wants to thank Andrey Naraikin for his repeated and valuable input as well as Bob Hayes and Mallick Arigapudi for their help in collecting data.
 

Comments on performance tests done

As both file system and compute nodes are in heavy day-to-day use we were limited in time and resource usage. Therefore not all tests could be repeated as often as we would have liked, and some tests could not be done at all. Comments and feedback are very much appreciated.

All tests were done on an otherwise quiet cluster. As I/O and MPI communication go over the same backbones, real life performance will be impacted if I/O and communication interfere. In day to day operation there also will be contention between users running multiple jobs at the same time. This is specifically true for the 1GE network (blocking factor about 1:2) and the InfiniBand OpenSM resource.

The tests used were:
 

  1. reading/writing large files to /dev/zero via dd: this test measures throughput only but is considered the very reliable even with high node counts as we could ensure all clients started the test at the same time.
  2. bonnie: standard file system test (http://code.google.com/p/bonnie-64/) available on all Unix* platforms. Tests both block and char based I/O.
  3. Iozone: standard file system test (www.iozone.org) available on all Unix platforms. Runs a series of subtests even more complete than bonnie. It also includes a cluster mode, even though this test only runs char based I/O.

Background on Lustre

Lustre is a high performance Cluster File System. In contrast to the more widely used SMB (Server Message Block, also known as Common Internet File System aka CIFS - the standard Windows* network file system) or NFS (Network File System - standard on UNIX systems) protocols, Lustre differentiates between storage servers and administrative systems (responsible for Metadata such as file names). This separation makes it far easier to scale both the bandwidth and storage capacity in a file system, as opposed to keeping all information on a single system. The basic layout of the Lustre systems used at the CRT-DC can be summarized as follows:
 

All Metadata information (for the Lustre experts - both MDT and MDS information) is stored on a server called MDT. We found the load and memory consumption on this system very low. In addition, a number of servers are used as Object Storage Targets (OST) to store the actual data, using either hard drives local to the OSTs, or attached via iSCSI.

When an application tries to write a file, it informs the MTD. The user can choose how many OSTs should be used for each and every file. Based on system and user dependant configuration, the MDT decides where place the data. It then informs the Lustre client software which OSTs to use. From that point on communication is mostly between the compute node and the OSTs. Therefore the MDT is NOT subject to heavy load. As communication between compute nodes and the OSTs is done via InfiniBand, the data exchange has both low latency and high bandwidth.

A note on striping: some applications, especially those with only one thread doing IO, will benefit from striping the information over several OSTs. On the other side, applications writing in parallel to multiple files will benefit by minimizing collisions, and therefore striping might not be beneficial at all.

There is no general rule as to which striping strategy is best - in each case users should find an optimal solution (see below for some tests on this subject).
 

Background on Panasas ActiveScale

The Panasas ActiveScale Storage Cluster* is essentially a NAS (Network Attached Storage) system combining reliability and scalability by splitting all information over multiple devices. The basic unit is a shelf holding up to 11 blades, multiple shelves will automatically combine into a cluster providing a unified interface to the user.

The blades come in 2 flavors: Director Blades keep control of the cluster and store Metadata information (filenames, permissions…). Actual data is stored on the Storage Blades .
 

File operation from any client starts with a request to a DirectorBlade. In a typical storage cluster more than one director would exist, both to prevent bottlenecks during parallel access as well as maintaining redundancy if any single blade should fail.

Each file is then divided into 64kb chunks and distributed between available StorageBlades. At this point most of the communication is driven between the client and StorageBlades, parallelizing the request and preventing overload of the DirectorBlade.

The Panasas Storage Cluster automatically stores files in a redundant way. Once a blade failure is detected, the blade is removed from the cluster, data redundancy is automatically rebuilt by recreating the information on different disks and the administrator is informed.

Because of these capabilities, the Panasas system at CRT-DC is used as safe data storage.
Clients are connected only via 1GE connection, while each Panasas shelf has a dedicated 10GE connection to the core Ethernet switch. A single client's capabilities are therefore limited by the network to a maximum of about 120 MByte/s.
 

Description of Hard- and Software

Panasas storage system
56TB capacity, standard long term storage system

7 Panasas shelves, each with:
1 director blade "DirectorBlade 100a"
10 storage blade HybridBlade 1032-4GB
1 10GE link to Ethernet backbone
Firmware: 3.5.1a

Description of the file systems used

CRT-DC currently employs 4 cluster file systems.

  • Panasas, 56TB storage connected via GB Ethernet to the clients 7 shelves each with 1 "DirectorBlade 100a" and 10 HybridBlade 1032-4GB labeled panfs in all tests
  • DDN based Lustre lfs3, 35TB storage connected via QDR InfiniBand 2 DDNS2A9900 controller with 160 SAS drives labeled lfs3 in all tests
  • Homegrown Lustre lfs4, 26TB storage connected via QDR InfiniBand based on 8 Intel® SSR212MC2BR server systems, each using 12 SAS disks labeled lfs4 in all tests
  • Homegrown Lustre lfs5, 3TB storage connected via InfiniBand based on 8 Supermicro SC213A-R900LPB server systems, each using 16 Intel SSDSA2SH032G1GN SSD hard drives labeled lfs5 in all tests

lfs3 details

35TB capacity, standard high performance cluster file system
1 MDT (Meta Data Server)
Intel SR1560SF server
Motherboard Intel S5400SF
Processors 2 x Xeon 2.8GHz E5462
Memory 16 x 2GB PC2-5300 667 MHz M395T5750EZ4-CE66
Bios S5400.86B.06.00.0032.070620091931
OS hard drive 1 x Seagate ST3250620NS
MDT 2 x Seagate ST3500320NS RAID0
IB adapter 1 MHJH29-XTC, CA type: MT26428 Dual 4X QDR 40Gb/s
8 OSS (Storage Server)
Intel SR1560SF server
Motherboard Intel S5400SF
Processors 2 x Xeon 2.8GHz E5462
Memory 16 x 1GB PC2-5300 667 MHz M395T2953CZ4-CE601
OS hard drive 1 x Seagate ST3250620NS<
OST (Targets) 2 LUNs from DDN backend mounted via SRP/IB
IB adapter 1 MHJH29-XTC, CA type: MT26428 Dual 4X QDR 40Gb/s
Backend DDN storage
RAID Controller 2 x DDN S2A 9900
16 LUNS each 10 Hitachi 15K300 SAS drives
Redhat 5.2 base system
Kernel 2.6.18-53.1.14.el5
Lustre: 1.6.5.1
OFED: 1.3.1

lfs4

26TB capacity, used for projects
1 MDT (Meta Data Server)
Intel SR1560SF server
Motherboard Intel S5400SF
Processors 2 x Xeon 2.8GHz E5462
Memory 16 x 2GB PC2-5300 667 MHz EBE21FD4AHWN-6E-E
Bios S5400.86B.06.00.0032.070620091931
OS hard drive 1 x Seagate ST9100821AS
MDT 2 x Seagate ST3500320NS Software RAID0 using mdadm
IB adapter 1 MHJH29-XTC, CA type: MT26428 Dual 4X QDR 40Gb/s
8 OSS (Storage Server)
Intel SSR212MC2BR server
Motherboard Intel S5000PSL
Processors 2 x Xeon 2.8GHz E5440
Memory 8 x 2GB PC2-5300 667 MHz Smart SG2567FBD12852HCDC
Bios S5000.86B.10.00.0094.101320081858 Release Date: 10/13/2008
OS hard drive 1 x ST9100821AS
OST (Targets) 12 x ST3300655SS in 4 x 3disk (826GB) RAID0 using mdadm
IB adapter 1 MHJH29-XTC, CA type: MT26428 Dual 4X QDR 40Gb/s
Redhat 5.2 base system
Kernel 2.6.18-92.1.10.el5
Lustre: 1.6.5.1
OFED: 1.3.1

lfs5

3 TB capacity, used for projects, due to small size strict one user at a time policy
1 MDT (Meta Data Server)
Intel SR1560SF server
Motherboard Intel S5400SF
Processors 2 x Xeon 2.8GHz E5462
Memory 16 x 2GB PC2-5300 667 MHz M395T5750EZ4-CE66
Bios S5400.86B.06.00.0032.070620091931
Bios Release Date: 07/06/2009
OS hard drive 1 x Seagate ST3250620NS
MDT 2 x Seagate ST3500320NS RAID0
IB adapter 1 MHJH29-XTC, CA type: MT26428 Dual 4X QDR 40Gb/s
8 OSS (Storage Server)
Supermicro SC213A-R900LPB
Motherboard Supermicro X8DTN+ Bios Release Date: 03/05/2009
Processors 2 x Intel® Xeon® Processor X5560 (Nehalem)
Memory 12 x 2GB DDR3 1333 MHz
OS hard drive 1 x ST3750640NS
OST (Targets) 16 x Intel SSDSA2SH032G1GN
Raid Controller 2 x AOC-USASLP-S8i SuperMicro RAID controller
IB adapter 1 x MHJH29-XTC , CA type: MT26428 Dual 4X QDR 40Gb/s
Redhat 5.4 base system
Kernel 2.6.18-128.7.1.el5_lustre_1.8.1.1 (source as provided by SUN; in essence it is a RH 5.3 kernel modified by Lustre 1.8.1.1 patches; configured and compiled according to CRT-DC standards)
Lustre: 1.8.1.1
Installed Lustre rpms:
lustre-modules-1.8.1.1-2.6.18_128.7.1.el5_lustre_1.8.1.1
lustre-1.8.1.1-2.6.18_128.7.1.el5_lustre_1.8.1.1
lustre-ldiskfs-3.0.9-2.6.18_128.7.1.el5_lustre_1.8.1.1
lustre-tests-1.8.1.1-2.6.18_128.7.1.el5_lustre_1.8.1.1
e2fsprogs-libs-1.39-23.el5
e2fsprogs-1.41.6.sun1-0redhat
OFED: 1.4.2

Clients

Intel SR1600UR 1U Server with Intel S5520UR main board
Intel® Xeon processor X5670; B0 or B1 step (code named Westmere), 2.93 GHz / 6.4 QPI 1333 95 W; 32KB L1/256KB L2/12MB L3 cache
24 GB memory organized as 6*4GB 1333MHz Reg ECC DDR3
BIOS: Rev 17.18, Ver S5500.86B.01.00.0045.112320091052, 23 Nov 2009
BIOS Configuration: default except Turbo Enabled, EIST Enabled, SMT disabled
QDR Infiniband MHQH29-XTC memfree, PCI-Express x8 Gen 2, Firmware 2.7.0
Harddisk: Seagate Cheetah ST3400755SS, 400 GB SAS HDD 10kRPM
Redhat 5.2
Redhat 5.3 kernel 2.6.18-128.1.14.el5, manually patched to include security updates
OFED 1.3.1
Lustre 1.6.4.3 patchless client

Influence of Stripesize and Stripecount

Lustre allows a user - without relying on root privileges - to specify how many OSTs can be used for any given file using the parameter Stripecount. On the SSD Lustre system we deployed 32 targets, each consisting of 4 SSD devices. The second parameter that can be easily adjusted is the Stripesize. One might compare Stripesize to the Blocksize of local file system.
 

We ran a couple of bonnie tests on a single node, varying Stripesize from 65k to 4M, and number of OSTs involved from 1 to 32. A Stripesize of 0 means default behavior and is identical to a Stripesize of 1MB. As expected, the performance varies and gives different optimal values for all 3 Lustre systems.

Overall, we found Stripesize has only significant effects if it's set to 64k. Using 512k or anything higher gives consistent results - differences from changing Stripesize are of the same order as run to run variations. One conclusion from these measurements was to set the default value for Stripesize to 0.

lfs3 increased both read and write performance until Stripecount reached 3 to 4. After 4 a steady decline was observed. lfs4 showed a similar behavior, but needed a count of 12 to reach the peak. The main difference between both systems is the speed of a single OST. As more OSTs are necessary, the overhead of managing 12 vs. 4 targets is most likely the reason in lfs4 being somewhat slower than lfs3.

lfs5 (the SSD solution) also reaches the maximum write speed at 4 targets. But in contrast, read speed peaks already with a single OST.

From these measurements we found it reasonable to set the default values to:
 

lfs3: default Stripecount 4, default Stripesize 0
lfs4: default Stripecount 12, default Stripesize 0
lfs5: default Stripecount 4, default Stripesize 0

Note: Panasas does not have any similar value
 

Multi-Node, single thread tests

dd writing/reading a 50GB file on 1 to 64 nodes

We measured the performance starting from a head node to up to 64 clients at the same time. On each node a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5, 12 on lfs4) enforced and dd executed.
 

  • Command started on all nodes in parallel:
    dd if=/dev/zero of=$DIR/XX1 bs=1024000 count=50000
    dd of=/dev/zero if=$DIR/XX1 bs=1024000 count=50000
  • Results:
    Averaging over 3 repeats, and aggregating the read and write speeds reported by ALL nodes in the test, we got the following results:
     

  • Aggregated peak values in GB/s were measured as:
    file system write read
    lfs3 5.7 3.6
    lfs4 4.6 5.0
    lfs5 13.3 17.7
    panfs 2.8 2.8
  • On a single target (only one node doing I/O) all 3 Lustre system give comparable results. On this side we are clearly limited by the client. Only if 8 or more clients are used, the SSD based lfs5 shows its tremendous scaling capacity. It's interesting to see that lfs4 beats the commercial lfs3 solution in read scalability, probably because lfs4 has twice the number of OSTs (32 vs. 16). The high number of OSTs also might be the reason that lfs4/lfs5 aggregated read performance exceeds the write performance at high node counts.
  • All systems show a peak in performance. If more nodes are added then the system can sustain, both per node and aggregated values breaks down considerably. Using more then 64 nodes we would expect this behavior to worsen.
     



Bonnie tests on 1 to 64 nodes

We measured the performance starting up from a head node to up to 64 clients at the same time. On each node a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5, 12 on lfs4) enforced and bonnie executed.
 

  • Command used (writing 16 volumes at 2GB each for 32GB total):
    bonnie -d $DIR -s 2000 -v 16 -y
  • With high node counts we found results on both the "writing intelligently" and "reading intelligently" subtests vary by a larger amount than can be statistically explained. A typical result:
     

    In this case most nodes gave results around 400 MB/s, but some nodes reaching values as high as 1300 MB/s. A likely explanation is interference between tests running independently on different nodes.

    On some nodes the first part of the test, which is using character driven I/O, will finish first. As there is no synchronization between the processes, the next subtest, doing fread/fwrite I/O, gets a head start. The test now also creates far more load and pushes the other jobs even further back (contention is suddenly far increased).

    We did not have time to proof this theory. In any case, the result is a grossly overestimated aggregate bandwidth. To compensate, the author decided to remove all questionable data points. Starting from the lowest 4 measurements subsequently higher points were added and new values for average and standard deviation computed. If all points were still in the range of Average +/- Stdv, the next higher data point was added.

    Even so the values for 64 nodes are likely exaggerated.
  • Aggregated peak values in GB/s were measured as:
     
    file system write read
    lfs3 6.7 4.0
    lfs4 4.6 4.8
    lfs5 11.5 21.8
    panfs 2.9 2.6

    Averaging over 3 repeats, and aggregating the read and write speeds reported by ALL nodes in the test, we got the following results:
     

    aggregated bonnie perf in MB/s, averaged over repeats



    per node bonnie perf in MB/s, averaged over repeats





Iozone benchmarking on 1 to 64 nodes

We measured the performance starting up from a head node to up to 64 clients at the same time. On each node a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5, 12 on lfs4) enforced and Iozone executed.
 

  • Command used for Lustre files systems:
    iozone -s 25000000 -r 1024 -I -c -o
  • Command used for the Panasas file system (-I option caused Iozone to crash):
    iozone -s 25000000 -r 1024 -c -o
  • Time constraints forced us to limit the file size to 25G. This was at least in part compensated for by the -I switch, which forces Iozone to bypass file system caching.
  • Block orientated subtests (fread, freread, fwrite, frewrite) showed to be even more influenced by reciprocal interference than with bonnie (note: the other subtests did exhibit this randomness).
     

    The methodology described in the previous chapter (see "Bonnie tests on 1 to 64 nodes ") did not give convincing results, so the author went with an even more drastic approach. The aggregated result was calculated by multiplying the lowest "per node" value by the number of nodes. Even so it is very likely that these results exaggerate the system's capabilities. Corrected results are marked in the table in RED. Even so, the results are hard to believe and do not mach dd and bonnie tests. They have been kept purely for sake of completeness.
  • Due to time constraints only one measurement could be taken, and no data for 64 nodes on Panasas was generated. Aggregating the read and write speeds reported by ALL nodes in the test, we got the following results. Results in RED are questionable:
     

    aggregated iozone perf in MB/s


    per node iozone perf in MB/s

Multi threaded single node tests

The tests described so far in this publication had one process per node doing I/O. Common real life I/O intensive applications often utilize multiply I/O streams. As each client system is equipped with 12 physical cores, test runs with 1 to 12 processes or threads per node in parallel were included.

dd writing/reading a 50GB file with 1 to 12 processes

We measured the performance starting up (from a head node) to up to 12 dd processes at the same time. For each process, a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) enforced and dd executed.
 

  • Command started in parallel:
    dd if=/dev/zero of=$DIR/XX1 bs=1024000 count=50000
    dd of=/dev/zero if=$DIR/XX1 bs=1024000 count=50000

Bonnie tests with 1 to 12 processes

We measured the performance starting up (from a head node) to 12 bonnie processes at the same time. For each process a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) enforced and bonnie executed.
 

  • Command started in parallel (writing 16 volumes at 2GB each for 32GB total):
    bonnie -d $DIR -s 2000 -v 16 -y

Iozone tests with 1 to 12 processes

Iozone tests using the same methodology again gave very unsatisfying results on the block based sub-tests. We therefore applied a different approach, using the Iozone throughput testing. This test includes synchronization mechanisms that prevent any subtests from starting before all other threads have completed the previous step.
Unfortunately throughput testing does not provide any results using fread/fwrite system calls. Lustre parameter were set on the parent directory (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) and Iozone executed.

  • Command used (XX denoting number of threads):
    iozone -s 25000000 -r 1024 -I -c -o -l XX -u XX -t XX

Results in MB/s: aggregated over all threads, interesting values marked red

  • The heavy duty dd tests showed a peak performance using 4 threads. Write performance reached around 3GB/s, read performance up to 1.7GB/s. bonnie tests confirmed the write results, but read was lower, while Iozone gave exactly the opposite result. One has to note though Iozone's write tests do not use the fwrite system call and therefore is radically different from bonnie and dd.
  • bonnie char based tests gave VERY similar results for all 3 Lustre FS.
  • Iozone tests gave mixed results. While lfs3 dominated write tests, lfs5 pulled ahead in read tests peaking at 3.2GB/s.
  • In most cases a single I/O process will use the Panases FS to the limit defined by the 1GB ETH interface.
  • In many cases performance peaked at 4 parallel processes. This could point to a limit in the Lustre transport layer, but we currently lack time to investigate it further. This aspect will need to be clarified in a future project.
  • No process pinning was used. As this may influence performance investigations are planned if and when time allows it.

Note: we found the maximum bandwidth available over InfiniBand to be 3.3GB per second.
 

Multi threaded multi node tests

To complete our analyses we also tested the impact of multiple nodes executing multiple parallel threads. lfs4 was not available at that time, as it was in use on a different project. Even a single threaded benchmark can saturate the 1GE interface used for Panasas. These tests were therefore only done on lfs3 and lfs5. Each system is equipped with 12 physical cores, therefore runs with 12 processes or threads in parallel were completed. The number of nodes was varied from 1 to 64.
 

dd writing/reading 12 files a 5GB in parallel on 1-64 nodes

We measured the performance starting up (from a head node) to 12 dd processes on all nodes at the same time. For each process, a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) enforced and dd executed.
 

  • Command started in parallel:
    dd if=/dev/zero of=$DIR/XX1 bs=1024000 count=5000
    # BARRIER - wait until all processes on this node have finished
    dd of=/dev/zero if=$DIR/XX1 bs=1024000 count=5000

Bonnie tests with 12 instances in parallel on 1-64 nodes each

We measured the performance starting up (from a head node) to 12 bonnie processes at the same time. For each process a separate directory was created, the correct Lustre parameter (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) enforced and bonnie executed.
 

  • Command started in parallel:
    bonnie -d $DIR -s 200 -v 16 -y

Iozone tests with 12 threads/client on 1-64 nodes

Iozone tests using the same methodology again gave very unsatisfying results on the block based sub-tests. We therefore applied a different approach, using the Iozone throughput testing. This test includes synchronization mechanisms that prevent any subtests from starting before all other threads have completed the previous step.

Unfortunately throughput testing does not provide any results using fread/fwrite system calls. The test was also limited to 256 threads and therefore could not run for 32 and 64 nodes. With 12 and 16 nodes it did not terminate properly, although the collected data were usable. From the various output fields "Children see throughput" was chosen a single representative result. Lustre parameter were set on the parent directory (Stripesize 1MB; Stripecount 4 on lfs3/lfs5) and Iozone executed.
 

  • Command used (XX denoting number of threads):
    iozone -s 2500000 -r 1024 -I -c -o -l XX -u XX -t XX

Results in MB/s: aggregated over all processes
 

  • Overall this test confirmed the results gained from the Multi-Node, single thread test. The results obtained by all single measurements (that is up to 768 bonnie or dd binaries running in parallel) varied far more due to missing synchronization.
  • The following table compares maximum performance for 2 tests, listing both the result and how many nodes were used.
     

Using 12 processes/node the peak values for buffered I/O tests were lower and reached at fewer nodes.

These scaling results - specifically the sharp drops after reaching a peak - are not matched by reports from engineers on real life applications. We assume this is caused both the access patterns and the test method used.

dd and bonnie results shown both report streaming I/O on huge files using buffered I/O. In many cases the char based tests of Iozone tests might be closer to real applications. These results showed a far smoother scaling, but do not reach the aggregated bandwidth possible on these systems (probably caused both by limitations of the I/O method, the MDT and the number of nodes in use).

Due to the variance in results we used the lowest data point in each data set, averaged over usually 3 measurements and multiplied by the number of streams in the subtest to get the aggregated value. This approach is likely to under-estimate the capabilities of the systems, an effect even more expressed for high thread count results.
 

Analyses

  • Single node performance is mostly limited by node capabilities and the network. Both homegrown and commercial solutions were not stressed by a single thread/application. In the case of Panasas, the I/O is limited by the 1GE network bandwidth. All Lustre systems (QDR InfiniBand) allow a single threaded dd to reach 100% CPU time. This makes CPU and on board memory the limiting factor.
  • All file systems scale to 4 parallel clients without much impact to the single application, Panasas and lfs5 scale well up to 16 clients. Nevertheless our users from time to time report serious performance fluctuations. Although we can not provide accurate data, these problems appear to coincide with several I/O intensive benchmarks running in parallel. Running the same I/O intensive applications at multiple core counts can also lead to discrepancies. This is also demonstrated by the variations shown in the Iozone benchmark on 64 nodes.

    If the file system has overwhelming influence on a special project or benchmark one should require dedicated access to either lfs4 or lfs5. In any case I/O and inter-process communication (like MPI) share the same InfiniBand interconnect. To achieve optimal results used in HPCC and SPEC MPI submissions it was even considered necessary to dedicate the cluster completely to these projects.
  • Lustre systems offer every user the option to optimize the CFS behavior to their application via Stripesize and Stripecount. Although we found Stripecount to have a huge impact, one should carefully tune these parameters to the application, as some settings proved to be counter intuitive.

    An example: for lfs3 and lfs5 we found the optimal Stripecount to be 4. This value was measured on a single node. Running dd/bonnie tests on 32 and 64 clients in parallel, we also collected results with a Stripesize of 1. The results were almost identical, although one would expect less contention and overall more performance in the latter case.
  • Overall the performance of the commercial lfs3 and the homegrown lfs4 solution are very similar. lfs3 has a number of advantages in handling and fault tolerance, and proved to be a reliable work horse. These factors should not be underestimated in any TCO analysis.
  • The SSD based lfs5 showed enormous scaling capabilities. Its aggregated bandwidth exceeded the disk based solutions by a factor of 2 (write) and 3 (read). It also showed tremendous advantages in a specific single node tests - the Iozone "backward read". With 1500 MB/s vs. 90MB/s it beat the next best solution by a factor of 150.

    Speed comes at a price - with a size of only 3.7 TB it has to be used strictly on a "per project" basis.
  • Single-node multi threaded tests showed very similar characteristics for lfs3 and lfs5. Depending on the test program used, at 4 I/O threads both file systems were capped by the InfiniBand interconnect at around 3GB/s.
  • Running 12 parallel I/O threads on every node did not change the result dramatically. On small number of nodes both lfs3 and lfs5 behave similar, while lfs5 demonstrates much better scaling and higher aggregate peak bandwidth.
  • Even though CRT-DC is a special case in being a pure BENCHMARKING datacenter, the author finds it unlikely one can fulfill all needs in a cluster using a single CFS. Typical scenarios include:
    safe long term storage (currently unavailable in CRT-DC)
    safe mid term storage (Panasas)
    fast CFS for standard use (lfs3)
    secondary CFS to prevent downtime in case primary systems fail (lfs4 + NFS based archive systems)
    separate CFS dedicated to projects (lfs4,lfs5)
    highly scaling CFS (lfs5)
  • Single-node file system tests exhibited problems when run in parallel on many nodes. While the simple dd tests performed well enough, bonnie and Iozone were not really adequate to measure file I/O capabilities aggregated on a complete cluster.
  • The author assumes interference between processes caused by the lack of internal barriers to be the main reason and would welcome any suggestions for improvements.
  • Iozone can be run in a ClusterMode. This mode sets barriers but does not measure fread/fwrite calls. The author discussed this problem with Don Capps (maintainer of Iozone). A future version of Iozone might contain this feature.
  • These tests were done on an otherwise quiet cluster, preventing communication intensive jobs from interfering with these I/O operations. In real life situation users do not have this luxury and although the IB fabric is non-blocking, bottlenecks like the IB subnet manager can lead to performance degradation.

    The results of running multiple Iozone and bonnie tests clearly showed that interference between jobs do happen, but is not possible to know in advance which of the competing jobs will run at speed. On a busy cluster I/O intensive jobs might very well give completely different results on subsequent runs.

    Budget constraints did not allow separating I/O from communication traffic via two distinct InfiniBand backbones. With increasing core density/node scaling multi node processes becomes increasingly difficult. Current plans therefore call for a separation of these networks with the next upgrade to the cluster.

Summary

The Intel Customer Response Team Data Center (called CRT-DC) located in DuPont/Washington compared high performance cluster file systems based on off the shelve servers with commercial solutions.
 

  • With current technology it is very easy to create a Cluster File System that can provide more bandwidth over QDR InfiniBand than a single 2 processor client typically found in modern clusters can use.
  • Performance wise home grown solutions based on "of the shelf" Intel servers can easily compete with commercial solutions. In the overall TCO calculations soft factors like support and uptime therefore become important parameters.
  • Benchmarking a cluster file system is a very complex problem. Currently the author does not know of any single software that could do this task easily. Iozone in throughput (or cluster) mode comes close, but is missing vital subtests.
  • A solution based on Intel SSDs can outperform far more expensive solutions based on standard HDDs. The challenge shifts here to find software able to use this system to full advantage.

This project also showed several limitations with regard to scaling which are not mirrored by real life applications. At the time of writing neither the location of the peaks observed (both in aggregated and per node bandwidth) nor the sharp performance drops when adding more I/O threads can be explained to our complete satisfaction.

In part this might be cause be the test method employed. Recently the author was made aware of IOR (http://sourceforge.net/projects/ior-sio). Extending Iozone's Cluster mode by adding support for buffered I/O would be another option, but both experiments will have to wait for a future project.
 

About the Author

Michale Hebenstreit grew up in Austria where he studied and received a PhD in physics. In 2005 he joined Intel focusing first on customer relationships. He is now Senior Cluster Architect running Intel's HPC benchmarking datacenter.

 

 

 

 

 

 

 

Appendix A: full sized charts



















Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.