Case Study: BerkeleyGW using Intel® Xeon Phi™ Processors

By Michael A Pearce,

Published:08/31/2016   Last Updated:08/31/2016

BerkeleyGW is a Materials Science application for calculating the excited state properties of materials such as band gaps, band structures, absoprtion spectroscopy, photoemission spectroscopy and more. It requires as input the Kohn-Sham orbitals and energies from a DFT code like Quantum ESPRESSO, PARATEC, PARSEC etc. Like such DFT codes, it is heavily dependent on FFTs, Dense Linear algebra and tensor contraction type operations similar in nature to those found in Quantum Chemistry applications. 

The target science application for the Cori timeframe is to study realistic interfaces in organic photo-voltaics. Such systems require 1000+ atoms and considerable amount of vacuum that contributes to the computational complexity. GW calculations general scale as the number of atoms to the fourth power (the vacuum space roughly counting as having more atoms). This is 2-5 times bigger problem than has been done in the past. Therefore, successfully completing these runs on Cori requires not only taking advantage of the compute capabilities of the Intel® Xeon Phi™ Processor architecture but also improving the scalability of the code in order to reach full-machine capability.

Check out the entire paper: http://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/application-case-studies/berkeleygw-case-study/

Lessons Learned

1. Optimal performance for this code required restructuring to enable optimal thread scaling, vectorization and improved data reuse.

2. Long loops are best for vectorization. In the limit of long loops, effects of loop peeling and remainders can be neglected.

3. There are many coding practices that prevent compiler auto-vectorization of code. The use of profilers and compiler reports can greatly aid in producing vectorizable code.

4. The absence on L3 cache on Intel® Xeon Phi™ architectures makes data locality ever more important than on traditional Intel® Xeon® architectures. 

5. Optimization is a continuous process. The limiting factor in code performance may change between IO/communication, memory bandwidth, latency and CPU clockspeed as you continue to optimize.

 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804