For single-node runs, refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ processors
This recipe describes a step-by-step process to get, build, and run NAMD (nanoscale molecular dynamics) code on the Intel® Xeon Phi™ and Intel® Xeon® processors to achieve better performance.
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD.
Build and Run the NAMD Cluster on an Intel Xeon Processor
Intel® Xeon® Processor E5-2697 v4, Intel® Xeon Phi™ processor 7250, and Intel® Xeon® Gold 6148 processor
Download the code
- Download the latest NAMD source code
- Download Open Fabric Interfaces (OFI). NAMD uses Charm++/OFI for multi-node.
- You can use the installed OFI library, which comes with the IFS package, or download and build it manually.
- To check the version of the installed OFI use the “fi_info --version” command (OFI 1.4.2 was used here).
- The OFI library can be downloaded from GitHub*.
- Download Charm++ with OFI support:
- Download the fftw3 version
Version 3.3.4 is used is this run.
- Download the apao and stvm workloads
Build the binaries
- Set the environment for compilation:
CC=icc; CXX=icpc; F90=ifort; F77=ifort export CC CXX F90 F77 source /opt/intel/compiler/<version>/compilervars.sh intel64
- Build the OFI library (you can skip this step if you want to use the installed OFI library):
./configure --prefix=<libfabric_install_path> --enable-psm2
make clean && make -j12 all && make install
- custom OFI can be used further using
LD_PRELOAD or LD_LIBRARY_PATH:
LD_PRELOAD=<libfabric_install_path>/lib/libfabric.so mpiexec.hydra …
- Build fftw3:
./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW
make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install
- Build multi-node version of Charm++:
Set CMK_TIMER_USE_RDTSC timer and unset another timers in file src/arch/ofi-linux-x86_64/conv-mach.h #define CMK_TIMER_USE_RTC 0 #define CMK_TIMER_USE_RDTSC 1 #define CMK_TIMER_USE_GETRUSAGE 0 #define CMK_TIMER_USE_SPECIAL 0 #define CMK_TIMER_USE_TIMES 0 #define CMK_TIMER_USE_BLUEGENEL 0 Unset CMK_TIMER_USE_GETRUSAGE in file src/arch/ofi-linux-x86_64/conv-mach-smp.h #undef CMK_TIMER_USE_GETRUSAGE #undef CMK_TIMER_USE_SPECIAL #define CMK_TIMER_USE_GETRUSAGE 0 #define CMK_TIMER_USE_SPECIAL 0
./build charm++ ofi-linux-x86_64 icc smp --basedir <libfabric_root_path> --with-production “-O3 -ip” -DCMK_OPTIMIZE
- Build NAMD:
- Modify the
arch/Linux-x86_64-iccto look like the following (select one of the FLOATOPTS options depending on the CPU type):
NAMD_ARCH = Linux-x86_64 CHARMARCH = multicore-linux64-iccstatic # For KNL FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE # For SKX FLOATOPTS = -ip -xCORE-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE # For BDW FLOATOPTS = -ip -xCORE-AVX2 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE CXX = icpc -std=c++11 -DNAMD_KNL CXXOPTS = -static-intel -O2 $(FLOATOPTS) CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4 CXXCOLVAROPTS = -O2 -ip CC = icc COPTS = -static-intel -O2 $(FLOATOPTS)
- Compile NAMD
./config Linux-x86_64-icc --charm-base <charm_root_path> --charm-arch ofi-linux-x86_64-smp-icc --with-fftw3 --fftw-prefix <fftw_install_path>--without-tcl --charm-opts -verbose
make clean && gmake –j
- Modify the
- Build memopt NAMD binaries:
Like BDW/KNL build with extra options “–with-memopt” for config.
Change the next lines in the *.namd file for both the stmv and opao1 workloads:
Run the binaries
- Set the environment for launching:
source /opt/intel/compiler/<version>/compilervars.sh intel64
specify host names to run on in “hosts” file
export MPIEXEC=“mpiexec.hydra -hostfile ./hosts”
export PSM2_SHAREDCONTEXTS=0 (if you use PSM2 < 10.2.85)
- Launch the task (for example with N nodes, with 1 process per node and PPN cores):
$MPPEXEC -n N -ppn 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0 For example for BDW (PPN=72): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 71 <workload_path> +pemap 1-71 +commap 0 For example for KNL (PPN=68, without hyper threads): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 67 <workload_path> +pemap 1-67 +commap 0 For example for KNL (with 2 hyper threads per core): $MPPEXEC -n 8 -ppn 1 ./namd2 +ppn 134 <workload_path> +pemap 0-66+68 +commap 67
- For KNL with MCDRAM in flat mode:
$MPPEXEC -n N -ppn 1 numactl -p 1 ./namd2 +ppn (PPN-1) <workload_path> +pemap 1-(PPN-1) +commap 0
To achieve better scale on multi-node, increase the count of the communication threads (1, 2, 4, 8, 13, 17). For example, the following is a command for N KNL nodes with 17 processes per node and 8 threads per process (7 worker threads and 1 communication thread):
$MPPEXEC -n $(($N*17)) -ppn 17 numactl -p 1 ./namd2 +ppn 7 <workload_path> +pemap 0-67,68-135:4.3 +commap 71-135:4
Basic Charm++/OFI knobs (should be added as NAMD parameters)
- +ofi_eager_maxsize: (default: 65536) Threshold between buffered and RMA paths
- +ofi_cq_entries_count: (default: 8) Maximum number of entries to read from the completion queue with each call to fi_cq_read().
- +ofi_use_inject: (default: 1) whether to use buffered send.
- +ofi_num_recvs: (default: 8) Number of pre-posted receive buffers.
- +ofi_runtime_tcp: (default: off) during the initialization phase the OFI EP names need to be exchanged among all nodes.
By default, the exchange is done with both PMI and OFI. If this flag is set then the exchange is done with PMI only.
$MPPEXEC -n 2 -ppn 1 ./namd2 +ppn 1 <workload_path> +ofi_eager_maxsize 32768 +ofi_num_recvs 16
Best performance results reported on an up to 128 Intel® Xeon Phi™ processor nodes cluster (ns/day; higher is better)
About the Authors
Alexander Bobyr is a CRT application engineer at the INNL lab at Intel supporting and providing feedback for HPC deals and SW Tools. He serves as a technical expert and representative for SPEC HPG. Alexander has a Bachelor’s degree in Intelligent Systems direction and a Master’s degree in Artificial Intelligence from Power Engineering Institute of Moscow, Russia.
Mikhail Shiryaev is a Software Development Engineer in Software and Services Group (SSG) at Intel. He is part of the Cluster Tools team working on the development of Intel MPI and Intel MLSL libraries. His major interests are high performance computing, distributed systems and distributed deep learning. Mikhail received his Master’s degree and his Bachelor’s degree in Software Engineering from Lobachevsky State University of Nizhny Novgorod, Russia.
Smahane Douyeb is currently working as a Software Apps Engineer in Software and Services Group (SSG) at Intel. Part of her job is to run and validate recipes and benchmarks for various HPC platforms for competitive testing purposes. She also works on HPC Python apps optimization on some Intel platforms. She received her Software Engineering Bachelor’s degree from Oregon Institute of Technology. She is very passionate about growing and learning to achieve her dream of becoming a Principle Engineer.