This whitepaper reflects a collaboration effort between Thomas Jefferson National Accelerator Facility and Intel Corporation on optimizing Quantum optimizing Chromo-dynamics (QCD) code for Intel(R) Xeon Phi(tm) Coprocessor. This work as published during the International Supercomputing Conference (ISC'13) in June 2013.
The Intel Xeon Phi coprocessor from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, our Dslash kernel sustains a performance of around 280 GFLOPS, while our full solver sustains around 215 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on Intel Xeon Phi Coprocessor nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.6 TFLOPS on 64 Intel Xeon Phi Coprocessors.