BerkeleyGW is a Materials Science application for calculating the excited state properties of materials such as band gaps, band structures, absoprtion spectroscopy, photoemission spectroscopy and more. It requires as input the Kohn-Sham orbitals and energies from a DFT code like Quantum ESPRESSO, PARATEC, PARSEC etc. Like such DFT codes, it is heavily dependent on FFTs, Dense Linear algebra and tensor contraction type operations similar in nature to those found in Quantum Chemistry applications.
The target science application for the Cori timeframe is to study realistic interfaces in organic photo-voltaics. Such systems require 1000+ atoms and considerable amount of vacuum that contributes to the computational complexity. GW calculations general scale as the number of atoms to the fourth power (the vacuum space roughly counting as having more atoms). This is 2-5 times bigger problem than has been done in the past. Therefore, successfully completing these runs on Cori requires not only taking advantage of the compute capabilities of the Intel® Xeon Phi™ Processor architecture but also improving the scalability of the code in order to reach full-machine capability.
1. Optimal performance for this code required restructuring to enable optimal thread scaling, vectorization and improved data reuse.
2. Long loops are best for vectorization. In the limit of long loops, effects of loop peeling and remainders can be neglected.
3. There are many coding practices that prevent compiler auto-vectorization of code. The use of profilers and compiler reports can greatly aid in producing vectorizable code.
4. The absence on L3 cache on Intel® Xeon Phi™ architectures makes data locality ever more important than on traditional Intel® Xeon® architectures.
5. Optimization is a continuous process. The limiting factor in code performance may change between IO/communication, memory bandwidth, latency and CPU clockspeed as you continue to optimize.