Prof. Dr.-Ing. André Brinkmann is a full professor at the computer science department of the Johannes Gutenberg University Mainz (JG) and head of the data center ZDV (since 2011). He received his Ph.D. in electrical engineering in 2004 from the University of Paderborn and has been an assistant professor at the University of Paderborn from 2008 to 2011. Furthermore, he has been the managing director of the Paderborn Centre for Parallel Computing PC2. His research interests focus on the application of algorithm engineering techniques in the area of storage systems, HPC, and cloud computing.
Lustre is a parallel file system, which is used in many Top 500 HPC clusters. Lustre has been designed to support a huge number of parallel applications running concurrently on these clusters. The load of each application is spread over many Object Storage Targets (OSTs), which serve as backend storage devices. Nevertheless, overlapping between different stripes can significantly reduce the available bandwidth of a Lustre environment.
The Network Request Scheduler (NRS) has been introduced in the Lustre mainline kernel in version 2.4.0 to enable quality of service (QoS) options, which previously have mostly been considered in networking. Standard QoS strategies include the token bucket strategy, where an average bandwidth is assigned to each client. Nevertheless, even if the NRS is able to enforce priorities between different applications, it is currently unable to optimize the overall bandwidth delivered by the file system.
The “Lustre QoS”-project will include additional information to improve the quality of the NRS and to optimize overall bandwidth delivery. The main idea is to include information about the striping targets of each client into the token bucket strategy to ensure that no individual OST will be overloaded. Additional approaches include request reordering to fit the strategies of the object storage targets and data layouts minimizing stripe overheads.
The project will start based on realistic simulations, which already include standard data layouts and access patterns of leadership class HPC environments. These new architectural approaches will be transferred into the Lustre NRS and monitoring source code, enabling higher overall storage bandwidth and fine-grained QoS. The (intermediate) results will be presented to the Lustre and HPC community to collect constant feedback within the design and implementation process.
L. Zeng, J. Kaiser, A. Brinkmann, Tsub-JGU, L.Xi, Q. Yingjin, S. Ihara, 5/31/2017, Providing QoS-mechanisms for Lustre through centralized control applying the TBF-NRS, Johannes Gutenberg University Mainz