LAMMPS is an open-source software package that simulates classical molecular dynamics. As it supports many energy models and simulation options, its versatility has made it a popular choice. It was first developed at Sandia National Laboratories to use large-scale parallel computation. As multi-core is now ubiquitous compared to when LAMMPS was first developed 20 years ago, LAMMPS is perfect for optimizing. The book (Intel® Xeon Phi™ Processor High Performance Programming, 2nd Edition – Knights Landing Edition) describes, among other things, how to optimize LAMMPS to take advantage of Intel® Xeon Phi™ Processor x200 (codenamed Knights Landing) as well as recent generation multicore Intel® Xeon® processors.
The LAMMPS code now exceeds half a million lines. For manageability it is organized into packages and the core codebase has limited functionality. Each package is installed separately, and as needed for simulations.
At the core, LAMMPS is parallelized using a spatial decomposition with the Message Passing Interface (MPI). Additional hybrid parallelization options are available in packages combining shared-memory parallelization with MPI using OpenMP* or CUDA*/OpenCL. As the number of cores sharing the memory subsystem increased, more developers exploited hybrid parallelism with the MPI+X programming model (where X represents a shared memory parallelization using OpenMP, POSIX threads, et al). LAMMPS supports hybrid parallelism with OpenMP for most important routines, and there is active development to improve shared memory performance. However, one MPI task per core typically performs best at the time of writing.
Several workloads were evaluated for performance, including protein, water, and liquid crystal benchmarks, and performance results from production simulations studying molecular alignment in organic solar cells and complex hydrocarbon thermodynamic properties and transport properties. The best performance that can be obtained in LAMMPS without the Intel package was used as the baseline for comparison.
With several optimizations to LAMMPS routines, there were significant performance improvements on Intel Xeon processors and Knights Landing multi-core processors. Optimizations on Knights Landing supporting vectorization and improving data layout resulted in much faster simulations over the un-optimized code. These same optimizations on Intel Xeon processors also improved performance. Simulations are now performing faster on Knights Landing compared to the best that could be performed a year ago on Haswell processors.
The optimizations described in this book chapter are available in LAMMPS as an optional package. This approach can give scientists access to improved performance now, while still allowing the developers to experiment with code modernization strategies to improve future performance and converge on models and algorithms that will perform best and can eventually be adopted as the default in LAMMPS. Developers should maximize overlap of internode communications with computation and avoid collective synchronization. These optimizations do not necessarily require changing the programming model, libraries, and directives for parallelization, but encourage careful attention to synchronization and data sharing and communication between the software processes and threads.