Robert van de Geijn is professor of computer science and core member of the Institute for Computational Engineering and Sciences, where he heads the Science of High-Performance Computing Group, which pursues foundational research in the field of linear algebra. The group’s focus on formal derivation of algorithms has led to the development of projects such as the libflame library, a modern, high-performance dense linear algebra library that targets both sequential and parallel architectures, and the BLIS framework which enables the the rapid creation of high-performance matrix operations on a variety of architectures. Prof. van de Geijn has published several books and more than 100 refereed publications.
Devin Matthews is the Arnold O. Beckman Postdoctoral Fellow in the Institute for Computational Engineering and Sciences at the University of Texas at Austin. His interests include high-accuracy quantum chemistry and tensor algorithms. He received his Ph.D. in Chemistry at UT Austin as a DOE Computational Science Graduate Fellow and received the Howes Scholar award for his work on massively-parallel quantum chemistry algorithms in the newly-developed Aquarius program.
The Science of High Performance Computing Group at UT Austin works on fundamental aspects of computer science, software, and algorithm development, as applied to high-performance computing in general and computational quantum chemistry in particular. Our group involves faculty members from computer science, chemistry, engineering, and statistics as well as researchers at the Texas Advanced Computing Center (TACC).
The modeling of chemical systems using quantum mechanics is essential to understanding the behavior and properties of these systems, such as reactivity, structure, catalytic and enzymatic activity, spectroscopic signatures, and bulk physical/mechanical properties. Detailed high-accuracy calculations of fundamental molecular systems—small molecules and clusters in the gas and solution phases—provide critical calibration for more approximate calculations as well as a quantitative and predictive tool for analyzing and explaining experimental data. These types of calculations often require techniques which go beyond standard methods such as the popular coupled cluster singles and doubles with perturbative triples model, CCSD(T), for which there are generally no high-performance parallel implementations available. Optimized, scalable implementations of these methods which could take advantage of advanced technologies such as many-core Intel® Xeon Phi™ processors would drastically increase the applicability of very high-accuracy calculations.
Our group has recently developed high-performance software related to this goal in several areas. First, we developed the NCC module, which plugs into the CFOUR quantum chemistry suite, to perform high-accuracy calculations using the CCSDTQ method and various related approximations such as CCSDT(Q) using a novel spin-adapted algorithm. CFOUR is an actively developed code with a broad user base, who apply it to diverse problems in spectroscopy, kinetics, and thermodynamics. This code achieves a significant performance improvement over existing implementations, but there is still significant room for improvement both in sequential performance and especially in multi-threaded performance and scalability. The performance issues in NCC stem mostly from two major sources
- The lack of “native” tensor and extended matrix algorithms beyond the traditional BLAS interface
- Limited opportunities for parallelism due to small matrix/tensor sizes
relatively short loops for course-grained parallelism with poor load-leveling, and a lack of hierarchical parallelism. Second, in the domain of dense linear algebra (DLA), meaning matrices as opposed to tensors in this context, our group has developed the BLIS framework, which uses a structured methodology to implement high-performance matrix operations using only straightforward C99 code and a single assembly-coded “micro-kernel”. Using an appropriate micro-kernel and cache blocking parameters, this implementation achieves very high performance on multi-core CPU architectures and on the Intel® Xeon Phi™ many-core architecture.
We have proposed to optimize the NCC module for highly parallel architectures such as Intel® Xeon® and Intel® Xeon Phi™ processors. These optimizations come in two broad categories which directly address the performance issues encountered in NCC.
- Novel tensor algorithms which leverage the BLIS framework to increase scalability, consolidate computation, increase data reuse, reduce latency and memory movement, and enable both large and small tensor operations to be computed efficiently
- A runtime system for extracting parallel execution of tensor operations from sequential program semantics which will allow for natural hierarchical parallelization and reduce synchronization.
- May 21, 2018, Learning from Optimizing Matrix-Matrix Multiplication, EduPar 2018
- Jianyu Huang, Robert A. van de Geijn, September 1, 2016, BLISlab: A Sandbox for Optimizing GEMM, Cornell University Library
- Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn, April 16, 2016, Implementing Strassen's Algorithm with BLIS, Cornell University Library
- Devin A. Matthews, July 1, 2016, High-Performance Tensor Contraction without BLAS, Cornell University Library
- Field G. Van Zee, Tyler M. Smith, October 18, 2016, Inducing complex matrix multiplication via the 3m and 4m methods FLAME Working Note #81, University of Texas at Austin
- Jianyu Huang, Leslie Rice, Devin A. Mattews, Robert A. van de Geijn, November 3, 2016, Generating Families of Practical Fast Matrix Multiplication Algorithms, Cornell University Library
- Marat Dukhan, Richard Vuduc, Jason Riedy, March 1, 2016, Wanted: Floating-Point Add Round-off Error instruction, Cornell University Library