DAOS Sets New Records with Intel Optane Persistent Memory

By Kelsey Rose Prantis

We are excited to announce a new world record for filesystem performance in the International Supercomputing 2020 IO500 list using only 30 storage servers equipped with 2nd Gen Intel® Xeon® Platinum processors and Intel® Optane™ persistent memory (PMem) running Distributed Asynchronous Object Storage (DAOS), beating today’s best supercomputers and ranking #1. The IO500 ranks the top file systems in the world based on balanced industry benchmarks so the industry can better compare the performance of different systems and technologies. This performance is made possible due to the unique hardware capabilities of Intel Optane PMem, combining low-latency byte-granular memory access with data persistence for small I/Os, and the new DAOS software stack built from the ground up to leverage these capabilities. This year, we were pleased to be joined by two customers, Texas Advanced Computing Center (TACC) and Argonne National Laboratory (ANL), who also contributed IO500 submissions using DAOS with Intel Optane PMem, and secured positions #3 and #4 on the list, respectively.

A bar chart showing the relative overall score of the top 7 entries in the IO500 full list. The entrants include Intel/DAOS, AWS/WekaIO, TACC/DAOS, ANL/DAOS, NUDT/Lustre, KISTI/IME, Oracle/BeeGFS.

The IO500 has an additional 10-node challenge, where all entries must have exactly 10 clients, enabling a more direct comparison of filesystem efficiency and per-server performance. In the 10-node challenge, the three systems equipped with 2nd Gen Intel Xeon Scalable processors, Intel Optane PMem and the DAOS file system took all top-3 rankings, and set a new bar for storage performance, with the first-place submission scoring more than three times the top competing non-DAOS system.

A bar chart showing the relative overall score of the top 6 entries in the IO500 10-node challenge. The entrants include Intel/DAOS, TACC/DAOS, ANL/DAOS, NVIDIA/Lustre, WekaIO, EPCC/GekkoFS

All-Flash Storage architectures are evolving towards hybrid, tiered models. For many designs today, that includes a combination of high-performance Intel Optane SSDs and high-capacity NAND based SSDs. However, the block interfaces in the existing storage software stacks of all SSDs and HDDs present fundamental limitations to IO performance that DAOS with Intel Optane PMem improve upon. Let’s take a quick look at what happens to our application’s data as it is converted to blocks. Consider the below diagram.

A flow chart showing the flow of data from structured data in the application to the layout of the data on across blocks as it is serialized onto disk.

First, on the top, a traditional HPC modeling and simulation application, represented by the three-dimensional matrix to the left. This application may use a layer of middleware to do its I/O, such as HDF or MPI-IO. Below it, an AI or Data Analytics application, which may store its data in a semi-structured or unstructured manner – such as a key:values, or perhaps a more domain specific data structure that preserves semantic meaning of the data. There are several middleware options for AI and Analytics applications as well, such as Apache Spark and Tensorflow. In either case, the application, or middleware, must take all the data and convert that data into a series of blocks. When the data does not perfectly align with the size of the block, now we have a choice – either leave a portion of a block empty, which will over time waste a huge amount of capacity for the system, or combine it with another piece of data so that we can utilize the whole block.

For large I/Os, this is not as much of an issue – we can more easily well-utilize the blocks. But for small I/Os, such as file metadata, we may have to combine several pieces of data into a single block. Or, if an application is generating I/O that is unaligned to the blocks, as real applications do, we’ll also end up with different data sharing blocks.

Now, when an application needs to have a lock on one piece of that data, it’s going to have to lock the entire block on which that data resides, or more than one block, if the piece of data is split across multiple blocks. If the application, whether from the same client or a different client, needs to lock a different piece of data that resides in the same block, it has to wait for the first activity to complete, effectively serializing the actions. Build millions of these interactions over time, and you are losing a lot of performance waiting for blocks to be free. The more data you have cohabitating in the same blocks, the bigger a problem this creates on your cluster.

As the data explosion continues to expand, this bottleneck will become a more and more serious problem. With the continued rise of AI and data analytics, there will be more small and unaligned I/O on storage systems than ever and at the same time data access time will be more critical than ever. Enter Intel Optane Persistent Memory, which combines low-latency and byte-addressable data access with persistence. For the first time in decades of storage technology development, storage systems do not have to be constrained by block-based IO.

But, the existing distributed storage software of today is a bottleneck, optimized for storage media with latency measured in milliseconds, and built around POSIX standards. The performance impact of these software bottlenecks is so dramatic, they leave the majority of the performance benefits Optane PMem offer on the table. These shortfalls cannot be accounted for by tweaking existing solutions alone – instead, scale out storage needed to be re-built from the ground up for new non-volatile memory technologies. We at Intel have been building this new open source storage stack, including the Persistent Memory Development Kit (PMDK), Storage Performance Development Kit (SPDK), and DAOS, and these IO500 results demonstrate just how powerful this combination can be.

A diagram depicting the DAOS architecture, alongside the architecture of a conventional storage system. For the conventional file system, all of the data and metadata are stored directly on SSDs or HDDs. For the DAOS architecture, metadata and small I/Os are stored on Intel pmem, while the larger block I/Os are stored on the NVMe SSDs.

The DAOS architecture is built on two fundamental building blocks. For all small IO, including metadata, the IO is stored directly on Intel Optane PMem. This is not a cache or buffer solution; the persistent memory is a first-class storage device. Larger, more block-friendly IO will be stored directly on the NVMe SSDs. This division, storing the small IO on the persistent memory while the large IO is stored on disk, is what enables DAOS to set new records for bandwidth and IOPs at arbitrary alignment and size, with significantly fewer systems, while still offering an attractive performance per dollar.

DAOS additionally breaks through other common industry bottlenecks, improving the performance even further. DAOS is entirely in userspace, utilizing the PMDK and SPDK open source libraries, not only bypassing the performance limitations of I/O through the kernel, but also alleviating other issues such as jitter, ease of administration, etc. DAOS also breaks free from the performance limitations of the POSIX interface, relying on the more performant optimistic concurrency control rather than the pessimistic concurrency control of POSIX. While applications can still use POSIX over DAOS, they are not limited to it.

All of this is done transparently to the end user – the application need not be conscientious of the tiering happening underneath within the DAOS servers. DAOS instead provides applications with a selection of interface options, such as the traditional POSIX interface or a key:value interface. Additionally, many common middleware and application frameworks have been enabled to run on DAOS. Applications using MPI-IO, HDF5, and Apache Spark can enjoy the full performance benefits of a DAOS back-end without having to re-write their applications. This list will continue to grow over time, with upcoming support for SEGY and ROOT formats as well.

This ecosystem is being cultivated so that there is a rich array of options for applications to gain the performance and other benefits of an Intel Persistent Memory DAOS solution, without having to significantly alter their applications. This enables a wide variety of applications to benefit, including modeling and simulation, life sciences and genomics, electronic devices and automation, financial services, AI solutions, high performance data analytics, and enterprise solutions.

DAOS 1.0 is newly released in June, targeted at partner integration and the DAOS Proof of Concept program. If you are interested in more information or test-driving Intel Optane Persistent Memory with DAOS for yourself, please visit the resources below.

Resources:

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804