Exploring Clustered Parallel File Systems and Object Storage

by Michael Ewan


This paper discusses recent research and testing of clustered, parallel file systems and object storage technology. Also included is an overview of product announcements from HP, IBM and Panasas in these areas.

The leading data access protocol for batch computing is currently Network File System (NFS), but even with bigger, faster, more expensive Network Attached Storage (NAS) hardware available, batch processing seems to have an insatiable appetite for I/O operations per second. The national labs have seen this NFS bottleneck in their high-performance computing (HPC) clusters and have abandoned NFS in favor of Lustre* at Pacific Northwest National Lab (PNNL) and Lawrence Livermore National Lab (LLNL). LLNL and Los Alamos National Lab (LANL) have adopted Panasas hardware-based object storage. This paper details the research and testing of clustered, parallel file systems with applications to batch pool HPC methodologies and discusses object storage technology. It also discusses recent product announcements from HP, IBM, and Panasas in these areas.

During the research, the following file systems were investigated:

  • Global File System* (GFS*) from Sistina (now Red Hat)
  • General Parallel Files System* (GPFS*) from IBM
  • iSCSI aggregation* from Terascale
  • Parallel Virtual File System* (PVFS*) and PVFS2* from Clemson University
  • Lustre from Cluster File Systems


Out of this investigation, a lab test of PVFS2 and Lustre based on scalability and access criteria was performed. This paper details those selection criteria and test results, plus features of each file system explored.

Two interesting emerging technologies are object storage devices and iSCSI. The Lustre parallel file system and Panasas both use object storage devices (called targets in Lustre). This paper describes object storage devices as well as storage aggregation using PVFS and iSCSI. What if by using either object storage targets in Lustre or storage aggregation, one could make use of all the excess storage in each compute node, simultaneously and in parallel? In a normal e-commerce (EC) environment, the available (free) storage would be on the order of 30 to 100 terabytes, dependent on disk size and number of compute nodes.

Why the Need for High-Performance Parallel File Systems?

As ubiquitous as NFS is, it is still a high-overhead protocol. Even with the newly available NFS V4 and its ability to combine operations into one request, benchmarks have shown that the performance is still substantially less than that available via other protocols. NFS is the current industry standard for NAS, sharable storage on UNIX* and Linux* servers. Its lack of scalability and coherency can be limitations for some high-bandwidth, I/O intensive applications, causing an imposing I/O bottleneck. However, if scalability and coherency are not issues, then NFS is a viable solution. The main problem with NFS is the single point of access to a server; any particular file is still only available on one server via one network interface. We can make bigger and faster servers with bigger and faster network pipes, but we still cannot grow to the volume of d ata that is necessary to support I/O intensive applications on a 1000-node cluster. Parallel access to multiple servers is the current solution to the problem of growing throughput and overall storage performance. Two approaches to the problem are parallel virtual file systems and object storage devices.

What is a Parallel File System?

In general, a parallel file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. This is similar to network link aggregation in which the I/O is spread across several network connections in parallel, each packet taking a different link path from the previous. Parallel file systems place data blocks from files on more than one server and more than one storage device. As the available servers and available storage devices are increased, throughput can easily be doubled or tripled, given enough network connections on the clients. As a client application requests data I/O, each sequential block request potentially can be going to an entirely different server or storage device. In essence, there is a linear increase in performance up to the total capacity of the network. PVFS is one product that provides this kind of clustered parallel access to data; another product is Lustre. These products provide performance and capacity scalability using different unique protocols. Lustre adds the concept of object storage devices to provide another layer of abstraction with the capability of later providing media redundancy on a file-by-file basis. See the Web links in the References section for more information on these software applications.

Clustered versus Parallel File Systems

Clustered file systems generally fall into the category of shared storage across multiple servers. Red Hat GFS is one of these. The product really isn’t designed for performance but for brokering access to shared storage and providing many-to-one access to data. When combined with high-availability (HA) software, GFS provides for a very resilient server configuration that scales up to 255 nodes. This means shared access to a single storage node, not performance scaling by striping data. IBM GPFS also provides simultaneous shared access to storage from multiple nodes, and adds a virtualization layer that provides transparent access to multiple storage sources via the IBM SAN File System. Neither of these products was considered in this evaluation due to scalability and/or cost issues. OpenGFS and Enterprise Volume Management System (EVMS) on SuSE Linux Enterprise Server* 9 (SLES9* could) provide exciting possibilities for low-cost high-performance Linux file servers utilizing storage virtualization.

Object Storage Devices

Each file or directory can be thought of as an object-an object with attributes. Each attribute can be assigned a value such as file type, file location, data stripes or not, ownership, and permissions. An object storage device allows us to specify for each file where to store the blocks allocated to the file, via a metadata server and object storage targets. Extending the storage attribute further, we can specify not only how many targets to stripe onto, but also what level of redundancy we want. Some implementations allow us to specify RAID0, RAID1, and RAID5 on a per-file basis. Panasas has taken the conce pt of object storage devices and implemented it entirely in hardware. Using a lightweight client on Linux, Panasas is able to provide highly scalable multi-protocol file servers, and they have implemented per-file level RAID (0 and 1 currently).

Figure 1. Object Storage Model

Figure 2. Data striping in Objects Storage


Many Linux clusters use slow shared I/O protocols, such as Network File System (NFS), the current de facto standard for sharing files. The resulting slow I/O can limit the speed and throughput of the Linux cluster. Lustre provides significant advantages over other distributed file systems. It runs on commodity hardware and uses object-based disks for storage and metadata servers for file system metadata (inodes). This design provides a substantially more efficient division of labor between computing and storage resources. Replicated, failover metadata servers (MDSs) maintain a transactional record of high-level file and file system changes. One or many object storage targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. File operations bypass the metadata server completely and fully utilize the parallel data paths to all OSTs in the cluster. This unique approach - separating metadata operations from data operations - results in significantly enhanced performance. This division of function leads to a truly scalable file system and more recoverability from failure conditions by providing the advantages of both journaling and distributed file systems.

Lustre supports strong file and metadata locking semantics to maintain total coherency of the file systems even under a high volume of concurrent access. File locking is distributed across the storage targets (OSTs) that constitute the file system, with each OST handling locks for the objects that it stores. Lustre technology is designed to scale while maintaining resiliency. As servers are added to a typical cluster environment, failures become more likely due to the increasing number of physical components. Lustre’s support for resilient, redundant hardware provides protection from inevitable hardware failures through transparent failover and recovery. Lustre has not yet been ported to support UNIX and Windows operating systems. Lustre clients can and probably will be implemented on non-Linux platforms, but as of this writing, Lustre is available only on Linux.

Currently, one additional drawback to Lustre is that a Lustre client cannot be on a server that is providing OSTs. This solution is being worked on and may be available soon; however, this limits the utility of Lustre for storage aggregation (see the discussion of Storage Aggregation below). Using Lustre, combined with a low-latency high-throughput cluster interconnect, you can achieve throughput numbers of well over 500 MB/sec, by striping data across hundreds of object storage targets.

Figure 3. Typical Lustre client/server configuration.[4]

Commercial Lustre

Lustre is an open, standards-based technology that is well funded and backed by the U.S. Department of Energy (DOE), the greater open source Linux community, Cluster File Systems, Inc. (Cluster FS), and Hewlett Packard (HP). Cluster FS provides commercial support for Lustre, and provides Lustre as an open source project. HP has taken Lustre, ProLiant* file servers running Linux, with HP StorageWorks* EVA disk arrays to provide a hardware/software product called HP Scalable File Server (SFS).

Lustre Performance

HP and PNNL have partnered on the design, installation, integration and support of one of the top 10 fastest computing clusters in the world. The HP Linux super cluster, with more than 1,800 Itanium® processors, is rated at more than 11 TFLOPS. PNNL has run Lustre for more than a year and currently sustains over 3.2 GB/s of bandwidth running production loads on a 53-terabyte Lustre-based file share. Individual Linux clients are able to write data to the parallel Lustre servers at more than 650 MB/s.


Parallel Virtual File System (PVFS) is an open source project from Clemson University that provides a lightweight server daemon to provide simultaneous access to storage devices from hundreds to thousands of clients. Each node in the cluster can be a server, a client, or both. At the time PVFS2 was installed and tested, there were no considerations in the product for redundancy, and Lustre provided more features and flexibility. Now that PVFS2 has progressed beyond version 1.0 and enterprise Linux (SLES9) has been deployed, PVFS2 should be considered for further testing and evaluation. Since storage servers can also be clients, PVFS2 supports striping data across all available storage devices in the cluster (storage aggregation, see below). PVFS2 is best suited for providing large, fast temporary scratch space.

Storage Aggregation

Rather than providing scalable performance by striping data across dedicated storage devices, storage aggregation provides scalable capacity by utilizing available storage blocks on each compute node. Each compute node runs a server daemon that provides access to free space on the local disks. Additional software runs on each client node that combines those available blocks into a virtual device and provides locking and concurrent access to the other compute nodes. Each compute node could potentially be a server of blocks and a client. Using storage aggregation on a 1000-node compute batch pool, 36 TB of free storage could potentially be gained for high-performance temporary space.

Two products in this area are Terrascale TerraGrid* and Ibrix Fusion*. Both of these products deserve a closer look in the future. There are obvious issues of reliability, since the mean time between failures (MTBF) is divided by the number of nodes. TerraGrid solves this problem by using the Linux native meta-device driver to provide mirroring of aggregated devices. Another issue that needs consideration in a scenario where the compute nodes are also serving storage blocks; how much of the compute resources are used serving blocks to other compute nodes?

High-Performance Computing and Cluster Technologies

This is outside the scope of this paper, but other technologies to investigate are the following:


  • High-bandwidth, low-latency interconnects such as InfiniBand*, where sustained data rates of over 800 MB/sec can be obtained with data I/O intensive processes and computing.
  • Single System Image clusters in order to wring the most performance out of computing resources.


New Work in pNFS

The IETF NFS v4 working group has introduced a parallel NFS (pNFS) protocol extension derived from work by Panasas, simply put, this protocol extension allows for object storage “like” access to parallel data sources using out of band metadata servers. See Gibson, IETF, and pNFS for details.


In summary, clustered, parallel file systems provide the highest performance and lowest overall cost for access to temporary design data storage in batch processing pools. Parallel cluster file systems remove our dependency on centralized monolithic NFS, and very expensive file servers for delivering data-to-batch processing nodes. Parallel cluster file systems provide storage aggregation over thousands of compute nodes. Parallel file systems can take advantage of low-latency, high-bandwidth interconnects, thus relieving file access of TCP/IP overhead and latency of shared Ethernet networks.

There are drawbacks to most of the parallel file system offerings, specifically in media redundancy, so currently the best application for clustered parallel file systems would be for high-performance scratch storage on batch pools or tape-out where source data is copied and simulation results are written from thousands of cycles simultaneously.

Additional References & Resources