Recipe: Build NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors for Multi-node Runs

For single-node runs, refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ processors

Purpose

This recipe describes a step-by-step process to get, build, and run NAMD (nanoscale molecular dynamics) code on the Intel® Xeon Phi™ and Intel® Xeon® processors to achieve better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD.

Build and Run the NAMD Cluster on an Intel Xeon Processor

Intel® Xeon® Processor E5-2697 v4, Intel® Xeon Phi™ processor 7250, and Intel® Xeon® Gold 6148 processor

Download the code

  1. Download the latest NAMD source code
  2. Download Open Fabric Interfaces (OFI). NAMD uses Charm++/OFI for multi-node.
    • You can use the installed OFI library, which comes with the IFS package, or download and build it manually.
    • To check the version of the installed OFI use the “fi_info --version” command (OFI 1.4.2 was used here).
    • The OFI library can be downloaded from GitHub*.
  3. Download Charm++ with OFI support:

    From Charmplusplus

  4. Download the fftw3 version

    Version 3.3.4 is used is this run.

  5. Download the apao and stvm workloads

Build the binaries

  1. Set the environment for compilation:
    			CC=icc; CXX=icpc; F90=ifort; F77=ifort
    export CC CXX F90 F77
    source /opt/intel/compiler/<version>/compilervars.sh intel64
    
  2. Build the OFI library (you can skip this step if you want to use the installed OFI library):
    1. cd <libfabric_root_path>
    2. ./autogen.sh
    3. ./configure --prefix=<libfabric_install_path> --enable-psm2
    4. make clean && make -j12 all && make install
    5. custom OFI can be used further using LD_PRELOAD or LD_LIBRARY_PATH:

    export LD_LIBRARY_PATH=<libfabric_install_path>/lib:${LD_LIBRARY_PATH}
    mpiexec.hydra …

    or
    LD_PRELOAD=<libfabric_install_path>/lib/libfabric.so mpiexec.hydra …

  3. Build fftw3:
    1. cd <fftw_root_path>
    2. ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
      Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW
    3. make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install
  4. Build multi-node version of Charm++:
    1. cd <charm_root_path>
    2. Set CMK_TIMER_USE_RDTSC timer and unset another timers in file src/arch/ofi-linux-x86_64/conv-mach.h
      #define CMK_TIMER_USE_RTC                                  0
      #define CMK_TIMER_USE_RDTSC                                1
      #define CMK_TIMER_USE_GETRUSAGE                            0
      #define CMK_TIMER_USE_SPECIAL                              0
      #define CMK_TIMER_USE_TIMES                                0
      #define CMK_TIMER_USE_BLUEGENEL                            0
      
      Unset CMK_TIMER_USE_GETRUSAGE in file src/arch/ofi-linux-x86_64/conv-mach-smp.h
      #undef CMK_TIMER_USE_GETRUSAGE
      #undef CMK_TIMER_USE_SPECIAL
      #define CMK_TIMER_USE_GETRUSAGE                            0
      #define CMK_TIMER_USE_SPECIAL                              0
      
    3. ./build charm++ ofi-linux-x86_64 icc smp --basedir <libfabric_root_path> --with-production “-O3 -ip” -DCMK_OPTIMIZE
  5. Build NAMD:
    1. Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):
      NAMD_ARCH = Linux-x86_64
      CHARMARCH = multicore-linux64-iccstatic
      
      # For KNL
      FLOATOPTS = -ip -xMIC-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      
      # For SKX
      FLOATOPTS = -ip -xCORE-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      
      # For BDW
      FLOATOPTS = -ip -xCORE-AVX2  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
      
      CXX = icpc -std=c++11 -DNAMD_KNL
      CXXOPTS = -static-intel -O2 $(FLOATOPTS)
      CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
      CXXCOLVAROPTS = -O2 -ip
      CC = icc
      COPTS = -static-intel -O2 $(FLOATOPTS)
      
    2. Compile NAMD
      1. ./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch ofi-linux-x86_64-smp-icc --with-fftw3 --fftw-prefix <fftw_install_path>--without-tcl --charm-opts -verbose
      2. cd Linux-x86_64-icc
      3. make clean && gmake –j
  6. Build memopt NAMD binaries:

    Like BDW/KNL build with extra options “–with-memopt” for config.

Other setup

Change the next lines in the *.namd file for both the stmv and opao1 workloads:

numsteps: 1000
outputtiming: 20
outputenergies: 600

Run the binaries

  1. Set the environment for launching:
    1. source /opt/intel/compiler/<version>/compilervars.sh intel64
    2. source /opt/intel/impi/<version>/intel64/bin/mpivars.sh
    3. specify host names to run on in “hosts” file
    4. export MPIEXEC=“mpiexec.hydra -hostfile ./hosts”
    5. export PSM2_SHAREDCONTEXTS=0 (if you use PSM2 < 10.2.85)
  2. Launch the task (for example with N nodes, with 1 process per node and PPN cores):
    1. $MPPEXEC -n N -ppn 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0
      
      For example for BDW (PPN=72):
      $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 71 <workload_path> +pemap 1-71 +commap 0
      
      For example for KNL (PPN=68, without hyper threads):
      $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 67 <workload_path> +pemap 1-67 +commap 0
      
      For example for KNL (with 2 hyper threads per core):
      $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 134 <workload_path> +pemap 0-66+68 +commap 67
      
    2. For KNL with MCDRAM in flat mode:
      $MPPEXEC -n N -ppn 1 numactl -p 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0

Remarks

To achieve better scale on multi-node, increase the count of the communication threads (1, 2, 4, 8, 13, 17). For example, the following is a command for N KNL nodes with 17 processes per node and 8 threads per process (7 worker threads and 1 communication thread):

$MPPEXEC -n $(($N*17)) -ppn 17 numactl -p 1 ./namd2 +ppn 7 <workload_path> +pemap 0-67,68-135:4.3 +commap 71-135:4

Basic Charm++/OFI knobs (should be added as NAMD parameters)

  • +ofi_eager_maxsize: (default: 65536) Threshold between buffered and RMA paths
  • +ofi_cq_entries_count: (default: 8) Maximum number of entries to read from the completion queue with each call to fi_cq_read().
  • +ofi_use_inject: (default: 1) whether to use buffered send.
  • +ofi_num_recvs: (default: 8) Number of pre-posted receive buffers.
  • +ofi_runtime_tcp: (default: off) during the initialization phase the OFI EP names need to be exchanged among all nodes.
    By default, the exchange is done with both PMI and OFI. If this flag is set then the exchange is done with PMI only.

For example:

$MPPEXEC -n 2 -ppn 1 ./namd2 +ppn 1 <workload_path> +ofi_eager_maxsize 32768 +ofi_num_recvs 16

Best performance results reported on an up to 128 Intel® Xeon Phi™ processor nodes cluster (ns/day; higher is better)

Workload/Node (2HT)124816
stmv (ns/day)0.551.051.863.315.31
Workload/Node (2HT)8163264128
stmv.28M (ns/day)0.1520.3100.5961.031.91

About the Authors

Alexander Bobyr is a CRT application engineer at the INNL lab at Intel supporting and providing feedback for HPC deals and SW Tools. He serves as a technical expert and representative for SPEC HPG. Alexander has a Bachelor’s degree in Intelligent Systems direction and a Master’s degree in Artificial Intelligence from Power Engineering Institute of Moscow, Russia.

Mikhail Shiryaev is a Software Development Engineer in Software and Services Group (SSG) at Intel. He is part of the Cluster Tools team working on the development of Intel MPI and Intel MLSL libraries. His major interests are high performance computing, distributed systems and distributed deep learning. Mikhail received his Master’s degree and his Bachelor’s degree in Software Engineering from Lobachevsky State University of Nizhny Novgorod, Russia.

Smahane Douyeb is currently working as a Software Apps Engineer in Software and Services Group (SSG) at Intel. Part of her job is to run and validate recipes and benchmarks for various HPC platforms for competitive testing purposes. She also works on HPC Python apps optimization on some Intel platforms. She received her Software Engineering Bachelor’s degree from Oregon Institute of Technology. She is very passionate about growing and learning to achieve her dream of becoming a Principle Engineer.

For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top

Hi

I also met this problem.

And I think CmiBarrierZero should be implemented in charm++/src/arch/ofi/machine.c .

Maybe the commitment didn't merge to the release branch yet.

Hello,

I tried to follow the instructions using charm-6.8.2, namd-2.12, but the build of namd2 fails at the link step with:

.rootdir/charm-6.8.2/ofi-linux-x86_64-slurmpmi-smp-icc/bin/../lib/libconv-core.a(convcore.o): In function `CmiTimerInit':
convcore.c:(.text+0x7126): undefined reference to `CmiBarrierZero'

CmiBarrierZero is referenced at line 1105 of convcore.c which is within an #if CMK_TIMER_USE_RDTSC block, thus it needs to be defined somewhere, but I can't find it.

- Martin

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.