This code recipe describes how to get, build, and use the LAMMPS* code with best performance on Intel® Xeon® and Intel® Xeon Phi™ processors along and with some performance comparisons.
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a classical molecular dynamics code. LAMMPS can be used to simulate solid-state materials (metals, semiconductors), soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. LAMMPS can be used to model atoms, or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale.
LAMMPS runs on single processors or in parallel using message-passing techniques with a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.
LAMMPS is distributed as open source code under the terms of the GNU Public License. The current version can be downloaded at http://lammps.sandia.gov/download.html. Links are also included to older F90/F77 versions. Periodic releases are also available on SourceForge*.
LAMMPS is distributed by Sandia National Laboratories, a U.S. Department of Energy laboratory. The main authors of LAMMPS are listed on the LAMMPS site along with contact information and other contributors. Find out more about LAMMPS at http://lammps.sandia.gov.
Code Support for Intel® Xeon® Processors and Intel® Xeon Phi™ Processors
LAMMPS for both Intel® Xeon® processors and Intel® Xeon Phi™ processors is available as part of current LAMMPS downloads. It includes support for simulation of soft matter, biomolecules, and materials.
This document describes instructions for running with LAMMPS version 13 May 16 or later. The recommended Intel® C++ Compiler version is 16.0.2 and later. Versions 14 and later support the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. Intel® MPI Library version 220.127.116.11 or later is also recommended.
One can download LAMMPS from the official site in a web browser or by using git:
$ git clone https://github.com/lammps/lammps.git lammps
Building LAMMPS for Intel Xeon processors and Intel Xeon Phi processors is similar to a normal LAMMPS build. Example makefiles are included with the LAMMPS package for building with optimized routines for both Intel Xeon processors (src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi) and Intel Xeon Phi processors (src/MAKE/OPTIONS/Makefile.knl). The principal difference between the two makefiles is that the Intel AVX-512 compiler flag is turned on for the Intel Xeon Phi processor build.1
Previous versions of Intel® Xeon Phi™ coprocessors (code-named Knights Corner) required separate compiles as the coprocessor used a different instruction set. A single compile can now produce code that executes on both Intel Xeon processors and Intel Xeon Phi processors.
An example build includes the following commands:
$ cd <install-dir>/lammps/src/ $ make yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule $ make yes-mpiio yes-opt yes-replica yes-rigid $ make yes-user-omp yes-user-intel $ source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh # if you want to edit the Makefile options $ emacs ./MAKE/OPTIONS/Makefile.knl $ emacs ./MAKE/OPTIONS/Makefile.intel_cpu_intelmpi # Binaries named lmp_knl and lmp_intel_cpu_intelmpi in build directory $ make knl -j 14; make intel_cpu_intelmpi -j 14 # Test directory with benchmarks $ cd ./USER-INTEL/TEST/
Previous versions of Intel® Xeon Phi™ (Knights Crossing) ran as coprocessors. To build LAMMPS with support for offload of computations to Intel® Xeon Phi™ coprocessors (versions 1 & 2), additional compile flags are required. An example makefile that supports offload is included (src/MAKE/OPTIONS/Makefile.intel_coprocessor).
To automatically take advantage of any available optimizations for Intel® processors, one can add the following arguments to the LAMMPS argument list:
-pk intel 0 -sf intel
Example scripts for running several benchmarks included with LAMMPS are show below.]
OFFLOAD: In the case of an offload environment, the coprocessors can be enabled by changing the ‘0’ to the number of coprocessors per compute module. E.g. –pk intel 2 –sf intel will use two coprocessors per node. Further, we recommend enabling hyperthreading for best performance on CPUs and coprocessors. If using offload with hyperthreading disabled, it might be beneficial to leave a core free for threads that handle the offload.
Intel Xeon processor (i.e. run_xeon_512.sh)
#!/bin/bash export MPI="mpirun" #run environment export LMP_BIN=../../lmp_intel_cpu_intelmpi #executable export LMP_CORES=36 #number of cores (mpi ranks) export LMP_ARGS="-pk intel 0 -sf intel -screen none" #arg list export LMP_THREAD_LIST="2" #OMP Threads per process export LMP_ROOT=../../../ export LOG_DIR=. export KMP_BLOCKTIME=0 source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh export I_MPI_PIN_DOMAIN=core #pin each MPI process to a core export I_MPI_FABRICS=shm #communication fabric = shared mem export WORKLOADS="lj rhodo lc sw water" #specify which problems to run $MPI -np $LMP_CORES $LMP_BIN -in in.lc_generate_restart -log none $LMP_ARGS export OMP_NUM_THREADS=$LMP_THREAD_LIST for threads in $LMP_THREAD_LIST do export OMP_NUM_THREADS=$threads for workload in $WORKLOADS do export LOGFILE=$LOG_DIR/$workload.$LMP_CORES"c"$threads"t".log echo "Running $LOGFILE" $MPI -np $LMP_CORES $LMP_BIN -in in.intel.$workload -log $LOGFILE $LMP_ARGS done done
Intel Xeon Phi processor (i.e. run_knl_512.sh)
#!/bin/bash MPI="numactl –m 1 mpirun" # Select fast memory and mpi environment export LMP_BIN=../../lmp_knl # Executable export LMP_CORES=68 # for Intel Xeon Phi processor 7250 #export LMP_CORES=64 # for Intel Xeon Phi processor 7210 export LMP_ARGS="-pk intel 0 -sf intel -screen none" #Runtime arguments export LMP_THREAD_LIST="2 4" #OMP threads per process export LMP_ROOT=../../../ export LOG_DIR=. export KMP_BLOCKTIME=0 source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh export I_MPI_PIN_DOMAIN=core #each MPI process pinned to a core export I_MPI_FABRICS=shm #communication fabric=shared mem export WORKLOADS="lj rhodo lc sw water" #select the workloads $MPI -np $LMP_CORES $LMP_BIN -in in.lc_generate_restart -log none $LMP_ARGS export OMP_NUM_THREADS=$LMP_THREAD_LIST for threads in $LMP_THREAD_LIST do for workload in $WORKLOADS do export LOGFILE=$LOG_DIR/$workload.$LMP_CORES"c"$threads"t".log echo "Running $LOGFILE" $MPI -np $LMP_CORES $LMP_BIN -in in.intel.$workload -log $LOGFILE $LMP_ARGS done done
Usage Model and Optimizations
Previous versions of the Intel® Xeon Phi™ coprocessor (code-named Knights Corner) required the development of an offload package to run key LAMMPS computational code on the coprocessor. The current Intel Xeon Phi processor imposes no such restriction, and, as such, the logic for offloading this work to a coprocessor has been disabled. The Intel Xeon processor and Intel Xeon Phi processor runs execute the same code path in these tests.
The Intel optimization package adds support for single, mixed, and double-precision calculations for both processors, and vectorization (Intel® Advanced Vector Extensions on the Intel Xeon processor and Intel AVX-512 on the Intel Xeon Phi processor). These optimizations include improved data layout for vectorization and cache efficiency, modifications to allow vectorization or improve vector performance, support for multiple precision modes including mixed precision calculations that maintain good Molecular Dynamics energy conservation and efficient use of vector units, use of new Intel AVX-512 conflict detection instructions for faster performance on supported processors, and efficient use of OpenMP* for Intel® Hyper-Threading Technology.
The Intel AVX-512 instructions on the Intel Xeon Phi processor provide additional performance potential over the Intel® Advanced Vector Extensions (AVX2) instructions on the Intel Xeon processor. The SIMD width for the vector registers is doubled to 512 bits from 256, potentially doubling the vector length. In addition, there are now two vector ports per core. Further, the register file has been increased to 32 512-bit registers from 16 256-bit registers. Intel AVX-512 also offers full support for masking vector calculations with fault suppression, yielding better performance for non-ideal vectorization such as branchy code and outer-loop vectorization. There is hardware support for exp2, invsqrt, and reciprocal with higher precision allowing for fast, inlined sequences for transcendentals that can replace math library calls. Additionally, new conflict detection hardware allows vector instructions to handle updates to the same location in memory from multiple SIMD data lanes.
Performance tests using the Intel package are illustrated below with comparisons between an Intel Xeon processor and two versions of the Intel Xeon Phi processors against five standard workloads: Atomic Fluid, Protein, Liquid Crystal, Silicon, and CG Water. In all cases, turbo mode is turned on.
Testing Platform Configurations
The following hardware was used for the above recipe and performance testing.
|Processor||Intel® Xeon® Processor E5-2697 v4||Intel® Xeon Phi™ Processor 7210|
Intel® Xeon Phi™ Processor 7250
|Stepping||1 (B0)||1 (B0)|
|Sockets / TDP||2S / 290W||1S / 215W|
|Frequency / Cores / Threads||2.3 GHz / 36 / 72||7210: 1.3 GHz / 64 / 256|
7250: 1.4 GHz / 68 / 272
|DDR4||8x16GB 2400 MHz||7210: 6x16 GB 2133 MHz|
7250: 6x16 GB 2400 MHz
|MCDRAM||N/A||16 GB Flat|
Red Hat Enterprise Linux* 6.7(3.10.0-229.20.1)
Red Hat Enterprise Linux 6.7(3.10.0-229.20.1)
LAMMPS Build Configurations
|LAMMPS Version||13 May 2016|
|Intel® Compiler Version||16.0.2|
|Intel® MPI Library Version||18.104.22.168|
|LAMMPS Makefiles Used||Makefile.intel_cpu_intelmpi and Makefile.knl|
|Optimization Flags||-xHost -O2 -fp-model fast=2 -no-prec-div -qoverride-limits|
1 Previous versions of Intel® Xeon Phi™ coprocessors (code-named Knights Corner) required separate compiles as the coprocessor used a different instruction set. A single compile can now produce code that executes on both Intel® Xeon® processors and Intel® Xeon Phi™ processors.