已发布:01/22/2016 最后更新时间:12/04/2018

**Robert van de Geijn** is professor of computer science and core member of the Institute for Computational Engineering and Sciences, where he heads the Science of High-Performance Computing Group, which pursues foundational research in the field of linear algebra. The group’s focus on formal derivation of algorithms has led to the development of projects such as the libflame library, a modern, high-performance dense linear algebra library that targets both sequential and parallel architectures, and the BLIS framework which enables the the rapid creation of high-performance matrix operations on a variety of architectures. Prof. van de Geijn has published several books and more than 100 refereed publications.

**Devin Matthews** is the Arnold O. Beckman Postdoctoral Fellow in the Institute for Computational Engineering and Sciences at the University of Texas at Austin. His interests include high-accuracy quantum chemistry and tensor algorithms. He received his Ph.D. in Chemistry at UT Austin as a DOE Computational Science Graduate Fellow and received the Howes Scholar award for his work on massively-parallel quantum chemistry algorithms in the newly-developed Aquarius program.

The Science of High Performance Computing Group at UT Austin works on fundamental aspects of computer science, software, and algorithm development, as applied to high-performance computing in general and computational quantum chemistry in particular. Our group involves faculty members from computer science, chemistry, engineering, and statistics as well as researchers at the Texas Advanced Computing Center (TACC).

The modeling of chemical systems using quantum mechanics is essential to understanding the behavior and properties of these systems, such as reactivity, structure, catalytic and enzymatic activity, spectroscopic signatures, and bulk physical/mechanical properties. Detailed high-accuracy calculations of fundamental molecular systems—small molecules and clusters in the gas and solution phases—provide critical calibration for more approximate calculations as well as a quantitative and predictive tool for analyzing and explaining experimental data. These types of calculations often require techniques which go beyond standard methods such as the popular coupled cluster singles and doubles with perturbative triples model, CCSD(T), for which there are generally no high-performance parallel implementations available. Optimized, scalable implementations of these methods which could take advantage of advanced technologies such as many-core Intel® Xeon Phi™ processors would drastically increase the applicability of very high-accuracy calculations.

Our group has recently developed high-performance software related to this goal in several areas. First, we developed the NCC module, which plugs into the CFOUR quantum chemistry suite, to perform high-accuracy calculations using the CCSDTQ method and various related approximations such as CCSDT(Q) using a novel spin-adapted algorithm. CFOUR is an actively developed code with a broad user base, who apply it to diverse problems in spectroscopy, kinetics, and thermodynamics. This code achieves a significant performance improvement over existing implementations, but there is still significant room for improvement both in sequential performance and especially in multi-threaded performance and scalability. The performance issues in NCC stem mostly from two major sources

- The lack of “native” tensor and extended matrix algorithms beyond the traditional BLAS interface
- Limited opportunities for parallelism due to small matrix/tensor sizes

relatively short loops for course-grained parallelism with poor load-leveling, and a lack of hierarchical parallelism. Second, in the domain of dense linear algebra (DLA), meaning matrices as opposed to tensors in this context, our group has developed the BLIS framework, which uses a structured methodology to implement high-performance matrix operations using only straightforward C99 code and a single assembly-coded “micro-kernel”. Using an appropriate micro-kernel and cache blocking parameters, this implementation achieves very high performance on multi-core CPU architectures and on the Intel® Xeon Phi™ many-core architecture.

We have proposed to optimize the NCC module for highly parallel architectures such as Intel® Xeon® and Intel® Xeon Phi™ processors. These optimizations come in two broad categories which directly address the performance issues encountered in NCC.

- Novel tensor algorithms which leverage the BLIS framework to increase scalability, consolidate computation, increase data reuse, reduce latency and memory movement, and enable both large and small tensor operations to be computed efficiently
- A runtime system for extracting parallel execution of tensor operations from sequential program semantics which will allow for natural hierarchical parallelization and reduce synchronization.

- May 21, 2018,
*Learning from Optimizing Matrix-Matrix Multiplication*, EduPar 2018 - Jianyu Huang, Robert A. van de Geijn, September 1, 2016, BLISlab: A Sandbox for Optimizing GEMM, Cornell University Library
- Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn, April 16, 2016,
*Implementing Strassen's Algorithm with BLIS*, Cornell University Library - Devin A. Matthews, July 1, 2016, High-Performance Tensor Contraction without BLAS, Cornell University Library
- Field G. Van Zee, Tyler M. Smith, October 18, 2016,
*Inducing complex matrix multiplication via the 3m and 4m methods FLAME Working Note #81*, University of Texas at Austin - Jianyu Huang, Leslie Rice, Devin A. Mattews, Robert A. van de Geijn, November 3, 2016, Generating Families of Practical Fast Matrix Multiplication Algorithms, Cornell University Library
- Marat Dukhan, Richard Vuduc, Jason Riedy, March 1, 2016, Wanted: Floating-Point Add Round-off Error instruction, Cornell University Library

英特尔的编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同（或不同）。这些优化包括 SSE2、SSE3 和 SSSE3 指令集和其他优化。对于在非英特尔制造的微处理器上进行的优化，英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于微处理器的优化仅适用于英特尔微处理器。某些非特定于英特尔微架构的优化保留用于英特尔微处理器。关于此通知涵盖的特定指令集的更多信息，请参阅适用产品的用户指南和参考指南。

通知版本 #20110804