LAMMPS* for Intel® Xeon Phi™ Coprocessor

Purpose

This code recipe describes how to get, build, and use the LAMMPS* code for the Intel® Xeon Phi™ coprocessor.

Introduction

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS*) is a classical molecular dynamics code. LAMMPS has potentials for solid-state materials (metals, semiconductors), soft matter (biomolecules, polymers), and coarse-grained or mesoscopic systems. LAMMPS can be used to model atoms, or, more generically, as a parallel particle simulator at the atomic, meso, or continuum scale.

LAMMPS runs on single processors or in parallel using message-passing techniques with a spatial-decomposition of the simulation domain. The code is designed to be easy to modify or extend with new functionality.

LAMMPS is distributed as open source code under the terms of the GNU Public License. The current version can be downloaded at http://lammps.sandia.gov/download.html. Links are also included to older F90/F77 versions. Periodic releases are also available on SourceForge*.

LAMMPS is distributed by Sandia National Laboratories, a U.S. Department of Energy laboratory. The main authors of LAMMPS are listed on the LAMMPS site along with contact info and other contributors. Find out more about LAMMPS at http://lammps.sandia.gov.

Code Support for Intel® Xeon Phi™ coprocessor

LAMMPS* with Intel® Xeon Phi™ coprocessor is available as part of current LAMMPS downloads. It includes support for potentials to allow simulation of soft matter, biomolecules, and materials.

Software Requirements

This document describes instructions for running with LAMMPS version 22Dec14 or later. The recommended Intel® C++ compiler version is 14.0.1.106. Versions 15.0.1.133 and later are also supported. Intel® MPI* version 5.0.2.044 or later is recommended.

Build Directions

Building LAMMPS for Intel® Xeon Phi™ coprocessor is similar to a normal LAMMPS build. Example makefiles are included for building LAMMPS with optimized routines (src/MAKE/OPTIONS/Makefile.intel_cpu), optimized routines with support for Xeon Phi™ offload (src/MAKE/OPTIONS/Makefile.intel_phi), and for building on the Stampede and Beacon clusters (src/MAKE/MACHINES/Makefile.stampede and src/MAKE/MACHINES/Makefile.beacon) . An example build will include the following commands:

             > source /opt/intel/compiler/2013_sp1.1.106/bin/iccvars.sh intel64
             > source /opt/intel/impi/5.0.2.044/bin64/mpivars.sh
             > cd src
             > make yes-asphere yes-class2 yes-kspace yes-manybody yes-misc yes-molecule 
             > make yes-mpiio yes-opt yes-replica yes-rigid
             > make yes-user-omp yes-user-intel
             > make intel_phi
             > echo “LAMMPS executable is src/lmp_intel_phi”

Run Directions

There are two approaches for running LAMMPS with an Intel® Xeon Phi™ coprocessor. In the first approach, the input script is edited as illustrated below. For short benchmarks, a warm-up run should be included so that the load balancer can find appropriate settings before starting the actual timing.

  1. Edit your run script, like you would with other packages (OPT, GPU, USER-OMP). See the figure below.
    run script
  2. Run LAMMPS as you would normally. The modified code handles the offloading to the coprocessor. See the figure below.
    run LAMMPS

Alternatively, you can run without modifying the input script using command-line options. In the simple approach, the switch ‘-sf intel’ is added to the LAMMPS command line. This will run the simulation using mixed precision, and offload to the coprocessor (if available) using automatic balancing of work for the coprocessor. To use non-default options, the package switch, ‘-pk intel’, can be used along with any options supported in the input script. For example:

            > mpirun –n 24 ./lmp_intel_phi –in in.rhodo –v x 2 –v y 2 –v z 2 –sf intel –pk intel 1 mode mixed balance -1 –pk omp 0

It is recommended to run with Intel® Hyper-Threading Technology enabled to allow for fast context switching with offload threads. If you do not have access to change this setting, running with a core per node free can give better performance in many cases. LAMMPS will generate a warning if it detects that leaving one or more cores free for offload might improve performance.

LAMMPS will simulate the time evolution of the input system of atoms or other particles, as specified in the input script, writing data, including atom positions, thermodynamic quantities, and other statistics computations.

Sample scripts and expected results/performance can be checked using the files in the ‘src/USER_INTEL/TEST’ directory provided with LAMMPS.

Optimizations and Usage Model

A LAMMPS load balancer offloads part of neighbor-list and non-bond force calculations to the Intel® Xeon Phi™ coprocessor for concurrent calculations with the CPU. This is achieved by using offload directives to run calculations well suited for many-core chips on both the CPU and the coprocessor. In this model, the same C++ routine is run twice, once with an offload flag, to support concurrent calculations.

The dynamic load balancing allows for concurrent 1) data transfer between host and coprocessor, 2) calculations of neighbor-list, non-bond, bond, and long-range terms, and 3) some MPI* communications. It continuously updates the fraction of offloaded work to minimize idle times. A standard LAMMPS “fix” object manages concurrency and synchronization.

The Intel® package adds support for single, mixed, and double-precision calculations on both CPU and coprocessor, and vectorization (AVX on CPU / 512-bit vectorization on Phi™). This can provide significant speedups for the routines on the CPU, too.

Performance Testing1,2

The advantages using the Intel® package are illustrated below with comparison to the baseline MPI/OpenMP* routines in LAMMPS and the optimized routines running on the CPU only or the CPU with offload to the coprocessor. Results are provided for the Rhodopsin* benchmark distributed with LAMMPS scaled to 512,000 atoms and also for a liquid crystal benchmark. See the ‘README’ file in the ‘src/USER-INTEL/Test' directory for instructions on how to reproduce the results shown.

The Rhodopsin benchmark simulates the movement of a protein in the retina that plays an important role in the perception of light. The protein is simulated in a solvated lipid bilayer using the CHARMM* force field with Particle-Particle Particle-Mesh long-range electrostatics and SHAKE* constraints. The simulation is performed at a temperature of 300K and pressure of 1 atm. The results on a single node and 32 nodes of the Endeavor cluster (configuration below) are shown, demonstrating a speedup of up to 1.78X when using the LAMMPS Intel package with a single Intel® Xeon Phi™ coprocessor on each node.


Figure: Rhodopsin protein benchmark with atoms in initial configuration

For more complicated potential energy models, such as those used in the simulation of materials, the potential speedups can be much higher. As an example, we show the results from a benchmark simulation included with LAMMPS that performs simulation of liquid crystal molecules simulated with the Gay-Berne potential. In this simulation, 512K biaxial ellipsoidal liquid crystal mesogens with a 2:1.5:1 aspect ratio and mass of 1.5 (reduced units) are simulated in the microcanonical ensemble following an initial equilibration in the isothermal-isobaric ensemble to reach a reduced temperature of 2.4 and pressure of 8.0. In this simulation, the cutoff is 4.0 with a skin of 0.8 (reduced units). When compared to the LAMMPS baseline, the speedup can be over 5X when using the LAMMPS Intel package with a single Intel® Xeon Phi™ coprocessor on each node. The results on a single node and 32 nodes of the Endeavor cluster (configuration below) are shown below.

Testing Platform Configurations3

The following hardware was used for the above recipe and performance testing.

 Endeavor Cluster Configuration:

  • 2-socket/24 cores:
  • Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading Technology4
  • Network: InfiniBand* Architecture Fourteen Data Rate (FDR)
  • Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64.crt1 #4 SMP Fri May 17 15:33:33 MDT 2013 x86_64 x86_64 x86_64 GNU/Linux
  • Memory: 64GB
  • Coprocessor: 2X Intel® Xeon Phi™ coprocessor 7120P: 61 cores @ 1.238 GHz, 4-way Intel® Hyper-Threading Technology, Memory: 15872 MB
  • Intel® Many-core Platform Software Stack Version 2.1.6720-19
  • Intel® Compiler 2013 SP1.1.106 (icc version 14.0.1)
  • Compile flags: -O3 -xAVX -fno-alias -ansi-alias -restrict -DLAMMPS_MEMALIGN=64 -override-limits -offload-option,mic,compiler,"-fp-model fast=2 -mGLOB_default_function_attrs=\"gather_scatter_loop_unroll=4\""

DISCLAIMERS

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

  1. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  2. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

    Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

    Notice revision #20110804

  3. For more information go to http://www.intel.com/performance
  4. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Intel, the Intel logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

*Other names and brands may be claimed as the property of others

.Copyright © 2014 Intel Corporation. All rights reserved.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.