| Last Modified On : | October 23, 2008 3:27 PM PDT |
Rate |
|
by Michael Ewan
This paper discusses recent research and testing of clustered, parallel file systems and object storage technology. Also included is an overview of product announcements from HP, IBM and Panasas in these areas.
The leading data access protocol for batch computing is currently Network File System (NFS), but even with bigger, faster, more expensive Network Attached Storage (NAS) hardware available, batch processing seems to have an insatiable appetite for I/O operations per second. The national labs have seen this NFS bottleneck in their high-performance computing (HPC) clusters and have abandoned NFS in favor of Lustre* at Pacific Northwest National Lab (PNNL) and Lawrence Livermore National Lab (LLNL). LLNL and Los Alamos National Lab (LANL) have adopted Panasas hardware-based object storage. This paper details the research and testing of clustered, parallel file systems with applications to batch pool HPC methodologies and discusses object storage technology. It also discusses recent product announcements from HP, IBM, and Panasas in these areas.
During the research, the following file systems were investigated:
Out of this investigation, a lab test of PVFS2 and Lustre based on scalability and access criteria was performed. This paper details those selection criteria and test results, plus features of each file system explored.
Two interesting emerging technologies are object storage devices and iSCSI. The Lustre parallel file system and Panasas both use object storage devices (called targets in Lustre). This paper describes object storage devices as well as storage aggregation using PVFS and iSCSI. What if by using either object storage targets in Lustre or storage aggregation, one could make use of all the excess storage in each compute node, simultaneously and in parallel? In a normal e-commerce (EC) environment, the available (free) storage would be on the order of 30 to 100 terabytes, dependent on disk size and number of compute nodes.
As ubiquitous as NFS is, it is still a high-overhead protocol. Even with the newly available NFS V4 and its ability to combine operations into one request, benchmarks have shown that the performance is still substantially less than that available via other protocols. NFS is the current industry standard for NAS, sharable storage on UNIX* and Linux* servers. Its lack of scalability and coherency can be limitations for some high-bandwidth, I/O intensive applications, causing an imposing I/O bottleneck. However, if scalability and coherency are not issues, then NFS is a viable solution. The main problem with NFS is the single point of access to a server; any particular file is still only available on one server via one network interface. We can make bigger and faster servers with bigger and faster network pipes, but we still cannot grow to the volume of d ata that is necessary to support I/O intensive applications on a 1000-node cluster. Parallel access to multiple servers is the current solution to the problem of growing throughput and overall storage performance. Two approaches to the problem are parallel virtual file systems and object storage devices.
What is a Parallel File System?
In general, a parallel file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. This is similar to network link aggregation in which the I/O is spread across several network connections in parallel, each packet taking a different link path from the previous. Parallel file systems place data blocks from files on more than one server and more than one storage device. As the available servers and available storage devices are increased, throughput can easily be doubled or tripled, given enough network connections on the clients. As a client application requests data I/O, each sequential block request potentially can be going to an entirely different server or storage device. In essence, there is a linear increase in performance up to the total capacity of the network. PVFS is one product that provides this kind of clustered parallel access to data; another product is Lustre. These products provide performance and capacity scalability using different unique protocols. Lustre adds the concept of object storage devices to provide another layer of abstraction with the capability of later providing media redundancy on a file-by-file basis. See the Web links in the References section for more information on these software applications.
Clustered versus Parallel File Systems
Clustered file systems generally fall into the category of shared storage across multiple servers. Red Hat GFS is one of these. The product really isn’t designed for performance but for brokering access to shared storage and providing many-to-one access to data. When combined with high-availability (HA) software, GFS provides for a very resilient server configuration that scales up to 255 nodes. This means shared access to a single storage node, not performance scaling by striping data. IBM GPFS also provides simultaneous shared access to storage from multiple nodes, and adds a virtualization layer that provides transparent access to multiple storage sources via the IBM SAN File System. Neither of these products was considered in this evaluation due to scalability and/or cost issues. OpenGFS and Enterprise Volume Management System (EVMS) on SuSE Linux Enterprise Server* 9 (SLES9* could) provide exciting possibilities for low-cost high-performance Linux file servers utilizing storage virtualization.
Object Storage Devices
Each file or directory can be thought of as an object—an object with attributes. Each attribute can be assigned a value such as file type, file location, data stripes or not, ownership, and permissions. An object storage device allows us to specify for each file where to store the blocks allocated to the file, via a metadata server and object storage targets. Extending the storage attribute further, we can specify not only how many targets to stripe onto, but also what level of redundancy we want. Some implementations allow us to specify RAID0, RAID1, and RAID5 on a per-file basis. Panasas has taken the conce pt of object storage devices and implemented it entirely in hardware. Using a lightweight client on Linux, Panasas is able to provide highly scalable multi-protocol file servers, and they have implemented per-file level RAID (0 and 1 currently).
New Work in pNFS
The IETF NFS v4 working group has introduced a parallel NFS (pNFS) protocol extension derived from work by Panasas, simply put, this protocol extension allows for object storage “like” access to parallel data sources using out of band metadata servers. See Gibson, IETF, and pNFS for details.
In summary, clustered, parallel file systems provide the highest performance and lowest overall cost for access to temporary design data storage in batch processing pools. Parallel cluster file systems remove our dependency on centralized monolithic NFS, and very expensive file servers for delivering data-to-batch processing nodes. Parallel cluster file systems provide storage aggregation over thousands of compute nodes. Parallel file systems can take advantage of low-latency, high-bandwidth interconnects, thus relieving file access of TCP/IP overhead and latency of shared Ethernet networks.
There are drawbacks to most of the parallel file system offerings, specifically in media redundancy, so currently the best application for clustered parallel file systems would be for high-performance scratch storage on batch pools or tape-out where source data is copied and simulation results are written from thousands of cycles simultaneously.

Joshua Konkle
Great article; I would like to emphasize that pNFS can support three storage types in the current standard.
Blocks - using FC, FCoE, iSCSI
Objects - using T10 OSD, i.e. Panasas model
Files - NFSv4.1 files layout, what you expect with NFS but only better and parallel
pNFS is really an abstraction layer supporting a standard for parallel read/write layout operations. It specifics one metadata server + multiple data servers in the spec, but implementations can vary in how many metadata servers there are as long a they support the layout standard properly, which is supported by a client kernel from kernel.org into the Linux distributions.
Here is the link to the pNFS Standard-Draft; once copy edited (600pages) it will be published, but this is the specification (including bad grammar/spelling) until copy edited.
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-29.txt
OSD is somewhat misunderstood due to the existing Internet/Enterprise Cloud and HPC object languages, Azure, HDF5, GAE, XAM.
Thanks for posting your article.
Joshua