Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors

Purpose

This recipe describes a step-by-step process of how to get, build, and run NAMD, Scalable Molecular Dynamic, code on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors for better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Find the details below of how to build on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors and learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/

Building and running NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Download the Code:

  1. Download the latest “Source Code” of NAMD from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
  2. Download charm++ 6.7.1 version
  3. Download fftw3 version(http://www.fftw.org/download.html)
    • Version 3.3.4 is used is this run
  4. Download apao1 and stvm workloads from here: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries:

  1. Recommended steps to build fftw3:
    • Cd<path>/fftw3.3.4
    • ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
                     Use xMIC-AVX512  for KNL or –xCORE-AVX2 for BDW
    • make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
  2. Build multicore version of charm++:
    • cd <path>/charm-6.7.1
    • ./build charm++ multicore-linux64 iccstatic --with-production "-O3 -ip"
  3. Build BDW:
    • Modify the Linux-x86_64-icc.arch to look like the following:
      NAMD_ARCH = Linux-KNL
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    • ./config Linux-x86_64-icc --charm-base <charm_path> --charm-arch multicore-linux64-  iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    • gmake -j
  4. Build KNL:
    • Modify the arch/Linux-KNL-icc.arch to look like the following:
      NAMD_ARCH = Linux-KNL
      CHARMARCH = multicore-linux64-iccstatic
      FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
    • ./config Linux-KNL-icc  --charm-base <charm_path> --charm-arch multicore-linux64-iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
    • gmake –j

Other system setup:

  1. Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271” One of the ways to change the settings (this could be different for every system):
    • First save your original grub.cfg to be safe
        cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG
    • In “/etc/default/grub”. Add (append) the below to the “GRUB_CMDLINE_LINUX”
        nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271
    • Save your new configuration
        grub2-mkconfig -o /boot/grub2/grub.cfg
    • Reboot the system. After logging in, verify the settings with 'cat /proc/cmdline’
  2. Change next lines in *.namd file for both workloads:
         numsteps             1000
         outputtiming          20
         outputenergies     600

Run NAMD:

  1. Run BDW (ppn = 72):
    $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
  2. Run KNL (ppn = 136, MCDRAM in flat mode, similar performance in cache):
    numactl –m 1 $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

Example: numactl –m 1 /NAMD_2.11_Source/Linux-KNL-icc/namd2 +p 136 apoa1/apoa1.namd +pemap 0-135

Performance results reported in Intel Salesforce repository (ns/day; higher is better):

Workload2S BDW 18c 2.3Ghz (ns/day)KNL bin1 (ns/day)KNL vs. 2S BDW (speedup)
stmv0.450.551.22x
Ap0a15.56.181.12x

Systems configuration:

ProcessorIntel® Xeon® Processor E5-2697 v4(BDW)Intel® Xeon Phi™ Processor 7250(KNL)
Stepping1 (B0)1 (B0) Bin1
Sockets / TDP2S / 290W1S / 215W
Frequency / Cores / Threads2.3 GHz / 36 / 721.4 GHz / 68 / 272
DDR4 8x16GB 2400 MHz(128GB)6x16 GB 2400 MHz
MCDRAMN/A16 GB Flat
Cluster/Snoop Mode/Mem ModeHomeQuadrant/flat
TurboOnOn
BIOSGRRFSDP1.86B0271.R00.1510301446GVPRCRB1.86B.0010.R02.1608040407
CompilerICC-2017.0.098ICC-2017.0.098
Operating SystemRed Hat Enterprise Linux* 7.2

(3.10.0-327.e17.x86_64)
Red Hat Enterprise Linux 7.2

(3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

Building and running NAMD for Cluster on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)

Build the Binaries:

  1. Set Intel tools for compilation:
    I_MPI_CC=icc;I_MPI_CXX=icpc;I_MPI_F90=ifort;I_MPI_F77=ifort
    export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77
    CC=icc;CXX=icpc;F90=ifort;F77=ifort
    export CC CXX F90 F77
    export I_MPI_LINK=opt_mt
  2. Recommended steps to build fftw3:
    • Cd<path>/fftw3.3.4
    • ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
    • Use xMIC-AVX512  for KNL or –xCORE-AVX2 for BDW
    • make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
  3. Recommended steps to build multicore version of charm++:
    • cd <path>/charm-6.7.1
    • chmod –R 777 *
    • source /opt/intel/compiler/<version>/compilervars.sh intel64
    • source /opt/intel/impi/<version>/bin/mpivars.sh
    •  ./build charm++ mpi-linux-x86_64 smp mpicxx ifort --with-production $base_charm_opts -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK
  4. Build on KNL:
    •  ./config Linux-KNL-icc --charm-base < fullPath >/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix <fullPath>/fftw3 --without-tcl --charm-opts –verbose
    • cd “Linux-KNL-icc”
    •   gmake -j
  5. Build on BDW:
    • ./config Linux-KNL-icc --charm-base $FULLPATH/charm-6.7.1 --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix $FULLPATH/fftw3 --without-tcl --charm-opts -verbose
    •  cd Linux-KNL-icc
    • make clean
    • gmake –j
  6. Build memopt NAMD binaries:
    • Like BDW/KNL build with extra options “–with-memopt” for config.

Run the Binaries (ps: “hosts”: is the file that contains the host names to run on):

  1. BDW run on single node:
    export I_MPI_PROVIDER=psm2
    export I_MPI_FALLBACK=no
    export I_MPI_FABRICS=tmi
    
    source /opt/intel/compiler/<version>/compilervars.sh intel64
    source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
    
    NTASKS_PER_NODE=1
    export MPPEXEC="time -p mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
    $MPPEXEC -n $node $BINPATH/$BINNAME +ppn 71 $FULLPATH/$WORKLOAD +pemap 1-71 +commap 0

    Example:
    $MPPEXEC -n 1 $FULLPATH/namd2” +ppn 71 $FULLPATH/stmv/stmv.namd” +pemap 1-71 +commap 0
     
  2. KNL Run on single node:
    export I_MPI_PROVIDER=psm2
    export I_MPI_FALLBACK=0
    export I_MPI_FABRICS=tmi
    export PSM2_IDENTIFY=1
    export PSM2_RCVTHREAD=0
    export TMI_PSM2_TEST_POLL=1
    
    NTASKS_PER_NODE=1
    export MPPEXEC="mpiexec.hydra -perhost $NTASKS_PER_NODE -f ./hosts "
    numactl -m 1 $MPPEXEC $BINPATH/$BINNAME +ppn 135 $FULLPATH/$WORKLOAD +pemap 1-135 +commap 0

    Example:
    numactl -m 1 $MPPEXEC $FULLPATH/namd2 +ppn 135 $FULLPATH/stmv/stmv.namd +pemap 1-135 +commap 0
     
  3. KNL Run on multi-node (node = number of nodes to run on):
    export MPPEXEC="mpiexec.hydra -perhost 1 -f ./hosts "
    numactl -m 1 $MPPEXEC -n $node numactl -m 1 $BINPATH/$BINNAME +ppn 134 $FULLPATH/$WORKLOAD +pemap 0-($ppn-1) +commap 67 

    Example:
    numactl -m 1 $MPPEXEC -n 8 numactl -m 1 $FULLPATH/namd2 +ppn 134 $FULLPATH/stmv/stmv.nand +pemap 0-66+68 +commap 67

Remark:

For better scale on multinodes run, please increase count of communication threads (1, 2, 4, 8, 13, 17). Example of a command run that can be used:

export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts "
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $BINPATH/$BINNAME +ppn 7  $FULLPATH/$WORKLOAD +pemap 0-67,68-135:4.3 +commap 71-135:4 > ${WKL}_cluster_commapN/${WKL}.$node.$

One usage example:

nodes="16 8 4 2 1" 
for node in ${nodes}
do
  	export MPPEXEC="mpiexec.hydra -perhost 17 -f ./hosts 
numactl -m 1 $MPPEXEC -n $(($node*17)) numactl -m 1 $FullPath.namd2  +ppn 8  $WorkloadPath/$WKL/$WKL.namd  +pemap 0-67+68 +commap 71-135:4 > $ResultFile.$node.$BINNAME.68c2t.commap_8th_from2cx4t
done

Best performance results reported on up to 128 Intel Xeon Phi nodes cluster (ns/day; higher is better):

Workload\node (2HT)

1

2

4

8

16

stmv (ns/day)

0.55

1.05

1.86

3.31

5.31

Workload\node (2HT)

8

16

32

64

128

stmv.28M (ns/day)

0.152

0.310

0.596

1.03

1.91

For more complete information about compiler optimizations, see our Optimization Notice.