Recipe: Build and Run NAMD on Intel® Xeon® Processors for Cluster

Published: 11/07/2017, Last Updated: 01/31/2020

For single-node runs, refer to the recipe: Building NAMD on Intel® Xeon® processors

Purpose

This recipe describes a step-by-step process to get, build, and run NAMD (nanoscale molecular dynamics) code on the Intel® Xeon® Processors to achieve better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD.

Build and Run the NAMD Cluster on Intel Xeon Processors

Intel® Xeon® Processor E5-2697 v4 (BDW), Intel® Xeon® Scalable Gold 6148 Processor (SKX), or Intel® Xeon® Platinum 8260L Processor (CLK) for Cluster over TCP

Download the code

  1. Download the latest NAMD source code
  2. Download Charm++ 6.8.2 version
    • You can get Charm++ from the NAMD "Source Code" of the "Version Nightly Build".
    • Or download it separately from Charmplusplus
  3. Download the fftw3 version

    Version 3.3.8 is used is this run.

  4. Download TLC for NAMD version 2.13 or later
  5. Download the apao and stvm workloads

Build the binaries

Note: Use –xCORE-AVX512 for SKX or CLX, and –xCORE-AVX2 for BDW

  1. Set the environment for compilation:
    I_MPI_CC=icc;I_MPI_CXX=icpc;I_MPI_F90=ifort;I_MPI_F77=ifort
    export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77
    source /opt/intel/compiler/<version>/compilervars.sh intel64
    
  2. Build fftw3:
    cd <fftw_root_path>
    ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
     make CFLAGS="-O3 -xCORE-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
    
  3. Build TLC:
    1. cd <fftw_root_path>
      ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
      make CFLAGS="-O3 -xCORE-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
      
  4. Build multi-node version of Charm++:
    • cd <charm_root_path>
      base_charm_opts="-O3 -ip -g -xCORE-AVX512" 
      unset I_MPI_LINK
      unset I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77
      ./build charm++ mpi-linux-x86_64 smp ifort mpicxx --with-production $base_charm_opts -DCMK_OPTIMIZE -DMPICH_IGNORE_CXX_SEEK 
      
  5. Build NAMD:
    • Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):
      NAMD_ARCH = Linux-x86_64
      CHARMARCH = multicore-linux64-iccstatic
      
      # For SKX or CLX
      FLOATOPTS = -ip -xCORE-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE -qopenmp-simd -qopt-zmm-usage=high
      
      # For BDW
      FLOATOPTS = -ip -xCORE-AVX2  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS
       
  • Modify the arch/Linux-x86_64.tcl to look like the following:
TCLDIR=<tcl_install_dir>
TCLINCL=-I$(TCLDIR)/include
TCLLIB=-L$(TCLDIR)/lib -ltcl8.7 -ldl -lpthread -lz
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)
  • Add these performance patches to use compiler higher than icc version 2016u4:
sed -i -e '1151 i #pragma omp simd simdlen(16)' <namdSource>/src/ComputeNonbondedBase.h
sed -i -e '1171 i #pragma omp ordered simd monotonic(hu:1)' <namdSource>//src/ComputeNonbondedBase.h
sed -i -e '1522 i #pragma omp simd simdlen(16)' <namdSource>/src/ComputeNonbondedBase.h
sed -i -e '1537 i #pragma omp ordered simd monotonic(plin:1, pli:1)' <namdSource>//src/ComputeNonbondedBase.h

sed -i -e 's|simd assert|omp simd|g' <namdSource>/src/ComputeNonbondedBase2.h
sed -i -e 's|simd assert|omp simd|g' <namdSource>/src/ComputeNonbondedBase2KNL.h
sed -i -e 's|simd|omp simd|g' <namdSource>/src/ComputeNonbondedMICKernelBase2_scalar.h
sed -i -e 's|simd|omp simd|<namdSource>/src/ComputeNonbondedMICKernelBase.h
sed -i -e 's|simd assert|omp simd|g' <namdSource>/src/Patch.C
sed -i -e 's|simd assert|omp simd|g' <namdSource>/src/Settle.C
  • Compile NAMD
I_MPI_CC=icc; I_MPI_CXX=icpc; I_MPI_F90=ifort; I_MPI_F77=ifort
CC=icc; CXX=icpc; F90=ifort; F77=ifort
export CC CXX F90 F77
export I_MPI_LINK=opt_mt
export I_MPI_CC I_MPI_CXX I_MPI_F90 I_MPI_F77

./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch mpi-linux-x86_64-ifort-smp-mpicxx --with-fftw3 --fftw-prefix <fftw_install_path>  --tcl-prefix <tcl_install_path>  --charm-opts -verbose
cd Linux-x86_64-icc
make clean
gmake -j 

Other setup

Change the next lines in the *.namd file for both workloads (stmv and opoa):

numsteps: 1000
outputtiming: 20
outputenergies: 600

Run the binaries

  • Set the environment for launching:
source /opt/intel/compiler/<version>/compilervars.sh intel64
source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
specify host names to run on in “hosts” file
export MPIEXEC=“mpiexec.hydra -hostfile ./hosts”
export PSM2_SHAREDCONTEXTS=0 (if you use PSM2 < 10.2.85)
  • If you want to use $N CLX 8260L nodes (Dual-Socket, 24 cores per node) with $PERHOST communication ranks per node, you should use the following command line:
 $MPPEXEC -perhost $PERHOST -n $total_comm_ranks ./namd2 +ppn $PPN <workload_path> +pemap $PEMAP +commap $COMMAP

where

total_comm_ranks = $PERHOST * $N
PPN = ($NCORES_PER_NODE - $PERHOST) / $PERHOST
STEP = $NCORES_PER_NODE / $PERHOST
COMMAP[0] = $PPN
COMMAP[i] = COMMAP[i-1] + $STEP;    i = 1…$PERHOST-1.         
           
So, finally we have COMMAP=” COMMAP[0],COMMAP[1],COMMAP[2], … ,COMMAP[$PERHOST-1] ”
PEMAP = list of all cores we don’t use for COMMAP.
PEMAP = “ 0-( COMMAP[0] – 1 ),( COMMAP[0] + 1 )-( COMMAP[1] – 1 ),( COMMAP[1] + 1 )-( COMMAP[2] – 1 ), … ,(COMMAP[$PERHOST-2] + 1)-COMMAP[$PERHOST-1]”

Example

We have 2 CLX 8260L nodes (Dual-Socket, 24 cores per node) and want to run with 8 communication ranks per node.

total_comm_ranks = 16
PPN = 5
STEP = 6
COMMAP = “5,11,17,23,29,35,41,47”
PEMAP = “0-4,6-10,12-16,18-22,24-28,30-34,36-40,42-46”

The final run line is:

$MPPEXEC -perhost 8 -n 16 ./namd2 +ppn 5 <workload_path> +pemap 0-4,6-10,12-16,18-22,24-28,30-34,36-40,42-46 +commap 5,11,17,23,29,35,41,47

Important

If you want to use HT, you need to increase: “ppn = 2 x ppn” and add “+$NCORES_PER_NODE” to each PEMAP[i].

Adjusting the previous example to run using HT:

$MPPEXEC -perhost 8 -n 16 ./namd2 +ppn 10 <workload_path> +pemap 0-4+48,6-10+48,12-16+48,18-22+48,24-28+48,30-34+48,36-40+48,42-46+48 +commap 5,11,17,23,29,35,41,47

Systems configuration

Processor

Intel® Xeon® Processor E5-2697 v4

Intel® Xeon® Scalable Gold 6148 Processor

Stepping

1 (B0)

1 (B0)

Sockets / TDP

2S / 290W

2S / 300W

Frequency / Cores / Threads

2.3 GHz / 36 / 72

2.4 GHz / 40 / 80

DDR4 

8x16GB 2400 MHz(128GB)

12x16GB 2666 MHz(192GB)

Cluster/Snoop Mode/Mem Mode

Home

Home

Turbo

On

On

BIOS

GRRFSDP1.86B0271.R00.1510301446

 

Compiler

ICC-2017.0.098

ICC-2019.4.243

Operating System

Red Hat Enterprise Linux* 7.2

Red Hat Enterprise Linux* 7.3

(3.10.0-327.e17.x86_64)

3.10.0-862.11.6.el7.x86_64

About the Authors

Alexander Bobyr is a CRT application engineer at the INNL lab at Intel supporting and providing feedback for HPC deals and SW Tools. He serves as a technical expert and representative for SPEC HPG. Alexander has a Bachelor’s degree in Intelligent Systems direction and a Master’s degree in Artificial Intelligence from Power Engineering Institute of Moscow, Russia.

Mikhail Shiryaev is a Software Development Engineer in Software and Services Group (SSG) at Intel. He is part of the Cluster Tools team working on the development of Intel MPI and Intel MLSL libraries. His major interests are high performance computing, distributed systems and distributed deep learning. Mikhail received his Master’s degree and his Bachelor’s degree in Software Engineering from Lobachevsky State University of Nizhny Novgorod, Russia.

Smahane Douyeb is currently working as a Software Apps Engineer in Software and Services Group (SSG) at Intel. Part of her job is to run and validate recipes and benchmarks for various HPC platforms for competitive testing purposes. She also works on HPC Python apps optimization on some Intel platforms. She received her Software Engineering Bachelor’s degree from Oregon Institute of Technology. She is very passionate about growing and learning to achieve her dream of becoming a Principle Engineer.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804