Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on a Single Node

By: Bobyr, Alexander, Shiryaev, Mikhail, Smahane Douyeb

For cluster run, please refer to the recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors on cluster

Purpose

This recipe describes a step-by-step process for getting, building, and running NAMD (scalable molecular dynamics code) on the Intel® Xeon Phi™ processor and Intel® Xeon® processor E5 family to achieve better performance.

Introduction

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.

NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Below are the details for how to build NAMD on the Intel Xeon Phi processor and Intel Xeon processor E5 family. You can learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/.

Building and Running NAMD on the Intel® Xeon® Processor E5-2697 v4 (formerly Broadwell (BDW)), Intel® Xeon Phi™ Processor 7250 (formerly Knight Landing (KNL)), and Intel® Xeon® Gold 6148 Processor (formerly Skylake (SKX))

Download the Code

  1. Download the latest NAMD source code from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
  2. Download the Charm++ 6.7.1 version.

    a. You can get Charm++ from the NAMD source code of the Version Nightly Build.

    b. Or download it separately: http://charmplusplus.org/download/

  3. Download the fftw3 version: http://www.fftw.org/download.html

    Version 3.3.4 is used is this run.

  4. Download apoa1 and stvm workloads: http://www.ks.uiuc.edu/Research/namd/utilities/

Build the Binaries

  1. Set environment for compilation:
    CC=icc; CXX=icpc; F90=ifort; F77=ifort
    export CC CXX F90 F77
    source /opt/intel/compiler/<version>/compilervars.sh intel64​
  2. Build fftw3:

    a.

    cd <fftw_root_path>

    b.

    ./configure --prefix=<fftw_install_path> --enable-single --disable-fortran CC=icc
    Use –xCORE-AVX512 for SKX, -xMIC-AVX512 for KNL and –xCORE-AVX2 for BDW 
    

    c.

    make CFLAGS=“-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits” clean install 
  3. Build a multicore version of Charm++:

    a.

    cd <charm_root_path>

    b.

    Set CMK_TIMER_USE_RDTSC timer and unset another timers in file src/arch/multicore-linux64/conv-mach.h

    define CMK_TIMER_USE_RDTSC                                1
    define CMK_TIMER_USE_GETRUSAGE                            0
    define CMK_TIMER_USE_SPECIAL                              0
    define CMK_TIMER_USE_TIMES                                0

    c.

    ./build charm++ multicore-linux64 iccstatic --with-production “-O3 -ip” 
  4. Build NAMD:

    a. Modify the arch/Linux-x86_64-icc to look like the following (select one of the FLOATOPTS options depending on the CPU type):

    NAMD_ARCH = Linux-x86_64
    CHARMARCH = multicore-linux64-iccstatic
    
    # For KNL
    FLOATOPTS = -ip -xMIC-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
    
    # For SKX
    FLOATOPTS = -ip -xCORE-AVX512  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
    
    # For BDW
    FLOATOPTS = -ip -xCORE-AVX2  -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE
    
    CXX = icpc -std=c++11 -DNAMD_KNL
    CXXOPTS = -static-intel -O2 $(FLOATOPTS)
    CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4
    CXXCOLVAROPTS = -O2 -ip
    CC = icc
    COPTS = -static-intel -O2 $(FLOATOPTS)
    

    b. Compile NAMD:

    i.

    ./config Linux-x86_64-icc --charm-base <charm_root_path>​ --charm-arch multicore-linux64- iccstatic --with-fftw3 --fftw-prefix <fftw_install_path> ​--without-tcl --charm-opts –verbose

    ii.

    gmake –j

Other System Setup

  1. Change the kernel setting for KNL: “nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271” Here is one way to change the settings (this could be different for every system):

    a. To be safe, first save your original grub.cfg:

    cp /boot/grub2/grub.cfg /boot/grub2/grub.cfg.ORIG

    b. In “/etc/default/grub” add (append) the following to

    “GRUB_CMDLINE_LINUX”: nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271

    c. Save your new configuration:

    grub2-mkconfig -o /boot/grub2/grub.cfg 

    d. Reboot the system. After logging in, verify the settings with “cat /proc/cmdline

  2. Change next lines in *.namd file for both workloads:

    numsteps 1000

    outputtiming 20

    outputenergies 600

Run NAMD

  • on SKL/BDW (ppn = 40 / ppn = 72 correspondingly):
    ./namd2 +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
  • on KNL (ppn = 136 (2 hyper threads per core), MCDRAM in flat mode, similar performance in cache):
    numactl -p 1 ./namd2 +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)

KNL Example

numactl -p 1 <namd_root_path>/Linux-KNL-icc/namd2 +p 136 apoa1/apoa1.namd +pemap 0-135

Performance results reported in the Intel Salesforce repository (ns/day; higher is better):

Workload

2S Intel® Xeon® Processor E5-2697 v4 18c 2.3 GHz (ns/day)

Intel® Xeon Phi™ Processor 7250  bin1 (ns/day)

Intel® Xeon Phi™ Processor 7250  versus 2S Intel® Xeon® Processor E5-2697 v4 (speedup)

stmv

0.45

0.55

 

1.22x

apoa1

5.5

 

6.18

1.12x

Workload

2S Intel® Xeon® Gold 6148 Processor 20c 2.4 GHz (ns/day)

Intel® Xeon Phi™ Processor 7250 versus 2S Intel® Xeon® Processor E5-2697 v4 (speedup)

stmv

0.73

1.44x

apoa1 original

 

7.68

1.43x

apoa1

8.70

 

1.44x

Systems configuration

Processor

Intel® Xeon® Processor E5-2697 v4

Intel® Xeon® Gold 6148 Processor

Intel® Xeon Phi™ Processor 7250

Stepping

1 (B0)

1 (B0)

1 (B0) Bin1

Sockets / TDP

2S / 290W

2S / 300W

1S / 215W

Frequency / Cores / Threads

2.3 GHz / 36 / 72

2.4 GHz / 40 / 80

1.4 GHz / 68 / 272

DDR4 

8x16 GB 2400 MHz (128 GB)

12x16 GB 2666 MHz (192 GB)

6x16 GB 2400 MHz

MCDRAM

N/A

N/A

16 GB Flat

Cluster/Snoop Mode/Mem Mode

Home

Home

Quadrant/flat

Turbo

On

On

On

BIOS

GRRFSDP1.86B0271.R00.1510301446

 

GVPRCRB1.86B.0010.R02.1608040407

Compiler

ICC-2017.0.098

ICC-2016.4.298

ICC-2017.0.098

Operating System

Red Hat Enterprise Linux* 7.2

Red Hat Enterprise Linux 7.3

Red Hat Enterprise Linux 7.2

(3.10.0-327.e17.x86_64)

(3.10.0-514.6.2.0.1.el7.x86_64.knl1)

(3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64)

About the Authors

Alexander Bobyr is a CRT application engineer at the INNL lab at Intel supporting and providing feedback for HPC deals and SW Tools. He serves as a technical expert and representative for SPEC HPG. Alexander has a Bachelor’s degree in Intelligent Systems direction and a Master’s degree in Artificial Intelligence from Power Engineering Institute of Moscow, Russia.

Mikhail Shiryaev is a Software Development Engineer in Software and Services Group (SSG) at Intel. He is part of the Cluster Tools team working on the development of Intel MPI and Intel MLSL libraries. His major interests are high performance computing, distributed systems and distributed deep learning. Mikhail received his Master’s degree and his Bachelor’s degree in Software Engineering from Lobachevsky State University of Nizhny Novgorod, Russia.

Smahane Douyeb is currently working as a Software Apps Engineer in Software and Services Group (SSG) at Intel. Part of her job is to run and validate recipes and benchmarks for various HPC platforms for competitive testing purposes. She also works on HPC Python apps optimization on some Intel platforms. She received her Software Engineering Bachelor’s degree from Oregon Institute of Technology. She is very passionate about growing and learning to achieve her dream of becoming a Principle Engineer.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.