By Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov§, Greg Henry, Aniruddha G Shet, George Chrysos, Pradeep Dubey
Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators.
In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel’s recently released Intel® Xeon Phi™ coprocessor in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of the coprocessor's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on the coprocessor employs novel dynamic scheduling and achieves close to 80% efficiency — the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.