Recipe: LAMMPS* for Intel® Xeon Phi™ Processors

Purpose

This code recipe describes how to get, build, and use the LAMMPS* code with best performance on Intel® Xeon® and Intel® Xeon Phi™ processors along and with some performance comparisons.

Introduction

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a classical molecular dynamics code. LAMMPS can be used to simulate solid-state materials (metals, semiconductors), soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. LAMMPS can be used to model atoms, or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale.

LAMMPS runs on single processors or in parallel using message-passing techniques with a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.

LAMMPS is distributed as open source code under the terms of the GNU Public License. The current version can be downloaded at http://lammps.sandia.gov/download.html. Links are also included to older F90/F77 versions. Periodic releases are also available on SourceForge*.

LAMMPS is distributed by Sandia National Laboratories, a U.S. Department of Energy laboratory. The main authors of LAMMPS are listed on the LAMMPS site along with contact information and other contributors. Find out more about LAMMPS at http://lammps.sandia.gov.

Code Support for Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

LAMMPS for both Intel® Xeon® processors and Intel® Xeon Phi™ processors is available as part of current LAMMPS downloads. It includes support for simulation of soft matter, biomolecules, and materials.

Software Requirements

This document describes instructions for running with LAMMPS version 13 May 16 or later. The recommended Intel® C++ Compiler version is 16.0.2 and later. Versions 14 and later support the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. Intel® MPI Library version 5.1.2.150 or later is also recommended.

One can download LAMMPS from the official site in a web browser or by using git:

$ git clone https://github.com/lammps/lammps.git lammps

Build Directions

Building LAMMPS for Intel Xeon processors and Intel Xeon Phi processors is similar to a normal LAMMPS build. Example makefiles are included with the LAMMPS package for building with optimized routines for both Intel Xeon processors (src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi) and Intel Xeon Phi processors (src/MAKE/OPTIONS/Makefile.knl). The principal difference between the two makefiles is that the Intel AVX-512 compiler flag is turned on for the Intel Xeon Phi processor build.1

Previous versions of Intel® Xeon Phi™ coprocessors (code-named Knights Corner) required separate compiles as the coprocessor used a different instruction set. A single compile can now produce code that executes on both Intel Xeon processors and Intel Xeon Phi processors.

An example build includes the following commands:

$ cd <install-dir>/lammps/src/
$ make yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule
$ make yes-mpiio yes-opt yes-replica yes-rigid
$ make yes-user-omp yes-user-intel
$ source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh

# if you want to edit the Makefile options
$ emacs ./MAKE/OPTIONS/Makefile.knl
$ emacs ./MAKE/OPTIONS/Makefile.intel_cpu_intelmpi

# Binaries named lmp_knl and lmp_intel_cpu_intelmpi in build directory
$ make knl -j 14; make intel_cpu_intelmpi -j 14 
# Test directory with benchmarks
$ cd ./USER-INTEL/TEST/

Previous versions of Intel® Xeon Phi™ (Knights Crossing) ran as coprocessors.  To build LAMMPS with support for offload of computations to Intel® Xeon Phi™ coprocessors (versions 1 & 2), additional compile flags are required. An example makefile that supports offload is included (src/MAKE/OPTIONS/Makefile.intel_coprocessor).

Run Directions

To automatically take advantage of any available optimizations for Intel® processors, one can add the following arguments to the LAMMPS argument list:

-pk intel 0 -sf intel

Example scripts for running several benchmarks included with LAMMPS are show below.[1]


[1]OFFLOAD: In the case of an offload environment, the coprocessors can be enabled by changing the ‘0’ to the number of coprocessors per compute module. E.g. –pk intel 2 –sf intel will use two coprocessors per node. Further, we recommend enabling hyperthreading for best performance on CPUs and coprocessors. If using offload with hyperthreading disabled, it might be beneficial to leave a core free for threads that handle the offload.

 


 

Intel Xeon processor (i.e. run_xeon_512.sh)

#!/bin/bash
export MPI="mpirun"					#run environment
export LMP_BIN=../../lmp_intel_cpu_intelmpi		#executable
export LMP_CORES=36					#number of cores (mpi ranks)
export LMP_ARGS="-pk intel 0 -sf intel -screen none"	#arg list
export LMP_THREAD_LIST="2"				#OMP Threads per process
export LMP_ROOT=../../../
export LOG_DIR=.
export KMP_BLOCKTIME=0

source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh
export I_MPI_PIN_DOMAIN=core				#pin each MPI process to a core
export I_MPI_FABRICS=shm				#communication fabric = shared mem
export WORKLOADS="lj rhodo lc sw water"			#specify which problems to run

$MPI -np $LMP_CORES $LMP_BIN -in in.lc_generate_restart -log none $LMP_ARGS
export OMP_NUM_THREADS=$LMP_THREAD_LIST
for threads in $LMP_THREAD_LIST
do
export OMP_NUM_THREADS=$threads
  for workload in $WORKLOADS
  do
    export LOGFILE=$LOG_DIR/$workload.$LMP_CORES"c"$threads"t".log
    echo "Running $LOGFILE"
    $MPI -np $LMP_CORES $LMP_BIN -in in.intel.$workload -log $LOGFILE $LMP_ARGS
  done
done

Intel Xeon Phi processor (i.e. run_knl_512.sh)

#!/bin/bash
MPI="numactl –m 1 mpirun"           			# Select fast memory and mpi environment
export LMP_BIN=../../lmp_knl				# Executable
export LMP_CORES=68					# for Intel Xeon Phi processor 7250
#export LMP_CORES=64					# for Intel Xeon Phi processor 7210

export LMP_ARGS="-pk intel 0 -sf intel -screen none"	#Runtime arguments
export LMP_THREAD_LIST="2 4"				#OMP threads per process
export LMP_ROOT=../../../
export LOG_DIR=.
export KMP_BLOCKTIME=0

source /opt/intel/parallel_studio_xe_2016.2.062/psxevars.sh
export I_MPI_PIN_DOMAIN=core				#each MPI process pinned to a core
export I_MPI_FABRICS=shm				#communication fabric=shared mem
export WORKLOADS="lj rhodo lc sw water"			#select the workloads

$MPI -np $LMP_CORES $LMP_BIN -in in.lc_generate_restart -log none $LMP_ARGS
export OMP_NUM_THREADS=$LMP_THREAD_LIST
for threads in $LMP_THREAD_LIST
do
  for workload in $WORKLOADS
  do
    export LOGFILE=$LOG_DIR/$workload.$LMP_CORES"c"$threads"t".log
    echo "Running $LOGFILE"
    $MPI -np $LMP_CORES $LMP_BIN -in in.intel.$workload -log $LOGFILE $LMP_ARGS
  done
done

 

Usage Model and Optimizations

Previous versions of the Intel® Xeon Phi™ coprocessor (code-named Knights Corner) required the development of an offload package to run key LAMMPS computational code on the coprocessor. The current Intel Xeon Phi processor imposes no such restriction, and, as such, the logic for offloading this work to a coprocessor has been disabled. The Intel Xeon processor and Intel Xeon Phi processor runs execute the same code path in these tests.

The Intel optimization package adds support for single, mixed, and double-precision calculations for both processors, and vectorization (Intel® Advanced Vector Extensions on the Intel Xeon processor and Intel AVX-512 on the Intel Xeon Phi processor). These optimizations include improved data layout for vectorization and cache efficiency, modifications to allow vectorization or improve vector performance, support for multiple precision modes including mixed precision calculations that maintain good Molecular Dynamics energy conservation and efficient use of vector units, use of new Intel AVX-512 conflict detection instructions for faster performance on supported processors, and efficient use of OpenMP* for Intel® Hyper-Threading Technology.

The Intel AVX-512 instructions on the Intel Xeon Phi processor provide additional performance potential over the Intel® Advanced Vector Extensions (AVX2) instructions on the Intel Xeon processor. The SIMD width for the vector registers is doubled to 512 bits from 256, potentially doubling the vector length. In addition, there are now two vector ports per core. Further, the register file has been increased to 32 512-bit registers from 16 256-bit registers. Intel AVX-512 also offers full support for masking vector calculations with fault suppression, yielding better performance for non-ideal vectorization such as branchy code and outer-loop vectorization. There is hardware support for exp2, invsqrt, and reciprocal with higher precision allowing for fast, inlined sequences for transcendentals that can replace math library calls. Additionally, new conflict detection hardware allows vector instructions to handle updates to the same location in memory from multiple SIMD data lanes.

Performance Testing

Performance tests using the Intel package are illustrated below with comparisons between an Intel Xeon processor and two versions of the Intel Xeon Phi processors against five standard workloads: Atomic Fluid, Protein, Liquid Crystal, Silicon, and CG Water. In all cases, turbo mode is turned on.

 

 

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.

ProcessorIntel® Xeon® Processor E5-2697 v4Intel® Xeon Phi™ Processor 7210
Intel® Xeon Phi™ Processor 7250
Stepping1 (B0)1 (B0)
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 727210: 1.3 GHz / 64 / 256
7250: 1.4 GHz / 68 / 272
DDR4 8x16GB 2400 MHz7210: 6x16 GB 2133 MHz
7250: 6x16 GB 2400 MHz
Memory  
MCDRAMN/A16 GB Flat
Cluster/Snoop ModeHomeQuadrant
TurboBothBoth
BIOSSE5C610.86B.01.01.0015.012820160943GVPRCRB1.86B.0009.D10.1602121844
Operating System

Red Hat Enterprise Linux* 6.7

(3.10.0-229.20.1)

Red Hat Enterprise Linux 6.7

(3.10.0-229.20.1)

LAMMPS Build Configurations

LAMMPS Version13 May 2016
Git Hash154eb1f886fde
Intel® Compiler Version16.0.2
Intel® MPI Library Version5.1.2.150
LAMMPS Makefiles UsedMakefile.intel_cpu_intelmpi and Makefile.knl
Optimization Flags-xHost -O2 -fp-model fast=2 -no-prec-div -qoverride-limits

1 Previous versions of Intel® Xeon Phi™ coprocessors (code-named Knights Corner) required separate compiles as the coprocessor used a different instruction set. A single compile can now produce code that executes on both Intel® Xeon® processors and Intel® Xeon Phi™ processors.

For more complete information about compiler optimizations, see our Optimization Notice.