NWChem* for the Intel® Xeon Phi™ Coprocessor

Purpose

This code recipe describes how to get, build, and use the NWChem* code that includes support for the Intel® Xeon Phi™ Coprocessor with Intel® Many-Integrated Core (MIC) architecture.

Introduction

NWChem provides scalable computational chemistry tools. NWChem codes treat large scientific computational chemistry problems efficiently, and they can take advantage of parallel computing resources, from high-performance parallel supercomputers to conventional workstation clusters.

NWChem software handles

  • Biomolecules, nanostructures, and solid-state
  • From quantum to classical, and all combinations
  • Ground and excited-states
  • Gaussian basis functions or plane-waves
  • Wide scalability, from one to thousands of processors
  • Properties and relativistic effects

NWChem is actively developed by a consortium of developers and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The code is distributed as open-source under the terms of the Educational Community License version 2.0 (ECL 2.0).

The current version of NWChem can be downloaded from http://www.nwchem-sw.org. Current support for Intel® Xeon Phi™ coprocessors is included in NWChem 6.5 or later. The latest development version, which can be downloaded at https://svn.pnl.gov/svn/nwchem/trunk and might contain additional NWChem modules with support for the Xeon Phi coprocessor.  Please check the release notes and NWChem documentation for further information.

Code Access

NWChem code supports the Intel® Language Extensions for Offload of operations of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel Xeon Phi coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To download NWChem, please go to http://www.nwchem-sw.org/index.php/Download and download the latest version.  It is advisable to download the source code version, so that you can configure NWChem for your system as desired.

Build Directions

The build of NWChem with offload support for Intel Xeon Phi coprocessors is split into three steps.

  1. Configure NWChem for your system.
  2. Enable offload support.
  3. Build NWChem.

Configure

Set the following configuration options (the following are in bash syntax):

export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX_UBOUND=65536
export USE_MPI=y
export NWCHEM_MODULES=all\ python
export USE_MPIF=y
export USE_MPIF4=y
export MPI_HOME=$I_MPI_HOME/intel64
export MPI_INCLUDE="$MPI_HOME"/include
export MPI_LIB="$MPI_HOME"/lib
export LIBMPI="-lmpi -lmpigf -lmpigi -lrt -lpthread"
export MKLROOT=/msc/apps/compilers/intel/14.0/composer_xe_2013_sp1.1.106/mkl/
export SCALAPACK_LIB=" -mkl -openmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK="$SCALAPACK_LIB"
export LAPACK_LIB="-mkl -openmp  -lpthread -lm"
export BLAS_LIB="$LAPACK_LIB"
export BLASOPT="$LAPACK_LIB"
export USE_SCALAPACK=y
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export LAPACK_SIZE=8
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export USE_CPPRESERVE=y
export USE_NOFSCHECK=y

Enable Offload Support

Set the following environment variables to enable offload support:

export USE_OPENMP=1
export USE_OFFLOAD=1

Build

To build NWChem, issue the following commands:

cd $NWCHEM_TOP/src
make FC=ifort CC=icc AR=xiar

This will build NWChem with support for Intel Xeon Phi coprocessors for the CCSD(T) method (as of 12 August 2014). Coprocessor support for more NWChem methods will follow in the future.

If you are running a cluster based on Intel® True Scale Fabric, please check the NWChem documentation for the correct configuration settings to use.

Running Workloads Using NWChem CCSD(T) Method

To run the CCSD(T) method you will need to use a proper NWChem input file that triggers this module. You can find an example input file in the Appendix of this document. Other input files that use the CCSD(T) method can be found on the NWChem website at http://www.nwchem-sw.org.

To run the code only on hosts in the traditional mode using plain Global Arrays (GA), run the following command:

$ OMP_NUM_THREADS=1 mpirun –np 768 –perhost 16 nwchem input.nw

This command will execute NWChem using a file called “input.nw” with 768 GA ranks and 16 processes per node (a total of 48 machines).

To enable OpenMP* threading on the host and use fewer total GA ranks run the following command:

$ OMP_NUM_THREADS=2 mpirun –np 384 –perhost 8 nwchem input.nw

This directs NWChem to use eight GA ranks per node and launches two threads for each process on the node. Because it uses less GA ranks, less communication takes place; thus, you should observe a speed-up compared to the plain method above.

Our next step is to enable offloading to the Intel Xeon Phi coprocessor, by executing this command:

$ NWC_RANKS_PER_DEVICE=2 OMP_NUM_THREADS=4 mpirun –np 384 –perhost 8 nwchem input.nw

The NWC_RANKS_PER_DEVICE environment variable enables offloading, if it is set to an integer larger than 0. It also controls how many GA ranks from the host will offload to each of the compute node’s coprocessors

In the example, we assume that the node contains two coprocessors, and NWChem should allocate two GA ranks per coprocessor. Hence, 4 out 8 GA ranks assigned to a particular compute node will offload to the coprocessors. During offload, a host core is idle; thus, we double the number of OpenMP threads for the host (OMP_NUM_THREADS=4 ) in order to fill the idle core with work from another GA rank.

NWChem itself automatically detects the available coprocessors in the system and properly partitions them for optimal use.

For best performance, you should also enable turbo mode on both the host system and the coprocessors, plus set the following environment variable to use large pages on the coprocessor devices:

export MIC_USE_2MB_BUFFER=16K

In all of the above cases, NWChem will produce the output files as requested in the input file.

Once NWChem prints the last lines on the console log, you will find a line that reports the total runtime consumed:

Total times  cpu:           wall: 

The reported runtimes will show considerable speedup for the OpenMP threaded version, as well as the offload version. Of course, the exact runtimes will depend on your system configuration. Experiment with the above settings to control OpenMP and offloading in order to find the best possible values for your system.

Performance Testing1,2

The following chart shows the speedups achieved on NWChem using the configuration listed below. Your performance may be different, depending on configurations of your systems, system optimizations, and NWChem settings described above.

 

Testing Platform Configurations

NodesIntel® Xeon® processor coresIntel® Xeon Phi™ coprocessor coresHeterogeneous cores
13020801560017680
23036802760031280
36057604320048960
45072005400061200

Server Configuration:

  • Atipa Visione vf442, 2-socket/16 cores, Intel® C600 IOH
  • Processors: Two Intel® Xeon® processor E5-2670 @ 2.60GHz (8 cores) with Intel® Hyper-Threading Technology3
  • Operating System: Scientific Linux* 6.5
  • Memory: 128GB DDR3 @ 1333 MHz
  • Coprocessors: 2X Intel® Xeon Phi™ Coprocessor 5110P, GDDR5 with 3.6 GT/s, Driver v3.1.2-1, FLASH image/micro OS 2.1.02.390
  • Intel® Composer XE 14.0.1.106 

Appendix: Example Input File

start  example

title example

echo

memory stack   4800 mb heap 200 mb global 4800 mb noverify

geometry units angstrom noprint
symmetry c1
C     -0.7143     6.0940    -0.00
C      0.7143     6.0940    -0.00
C      0.7143    -6.0940     0.00
C     -0.7143    -6.0940     0.00
C      1.4050     4.9240    -0.00
C      1.4050    -4.9240     0.00
C     -1.4050    -4.9240     0.00
C     -1.4050     4.9240     0.00
C      1.4027     2.4587    -0.00
C     -1.4027     2.4587     0.00
C      1.4027    -2.4587    -0.00
C     -1.4027    -2.4587     0.00
C      1.4032    -0.0000    -0.00
C     -1.4032     0.0000     0.00
C      0.7258     1.2217    -0.00
C     -0.7258     1.2217     0.00
C      0.7258    -1.2217     0.00
C     -0.7258    -1.2217     0.00
C      0.7252     3.6642    -0.00
C     -0.7252     3.6642     0.00
C      0.7252    -3.6642     0.00
C     -0.7252    -3.6642     0.00
H     -1.2428     7.0380    -0.00
H      1.2428     7.0380    -0.00
H      1.2428    -7.0380     0.00
H     -1.2428    -7.0380     0.00
H      2.4878     4.9242    -0.00
H     -2.4878     4.9242     0.00
H      2.4878    -4.9242    -0.00
H     -2.4878    -4.9242     0.00
H      2.4862     2.4594    -0.00
H     -2.4862     2.4594     0.00
H      2.4862    -2.4594    -0.00
H     -2.4862    -2.4594     0.00
H      2.4866    -0.0000    -0.00
H     -2.4866     0.0000     0.00
end

basis spherical noprint
H    S
     13.0100000              0.0196850
      1.9620000              0.1379770
      0.4446000              0.4781480
H    S
      0.1220000              1.0000000
H    P
      0.7270000              1.0000000
#BASIS SET: (9s,4p,1d) -> [3s,2p,1d]
C    S
   6665.0000000              0.0006920             -0.0001460
   1000.0000000              0.0053290             -0.0011540
    228.0000000              0.0270770             -0.0057250
     64.7100000              0.1017180             -0.0233120
     21.0600000              0.2747400             -0.0639550
      7.4950000              0.4485640             -0.1499810
      2.7970000              0.2850740             -0.1272620
      0.5215000              0.0152040              0.5445290
C    S
      0.1596000              1.0000000
C    P
      9.4390000              0.0381090
      2.0020000              0.2094800
      0.5456000              0.5085570
C    P
      0.1517000              1.0000000
C    D
      0.5500000              1.0000000
#END
end

scf
#thresh 1.0e-10
#thresh 1.0e-4
#tol2e 1.0e-10
#tol2e 1.0e-8
#noscf
singlet
rhf
vectors input atomic output pent_cpu_768d.movecs
direct
noprint "final vectors analysis" multipole
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
maxiter 10
io ga
tilesize 24
end

set tce:pstat t
set tce:nts  t

task tce energy

 

For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
Jeff (Intel)'s picture

The Xeon Phi code does not call MKL.  The NWChem build system should use the "-mmic" flag where it is appropriate to do so.

Wadud M.'s picture

How comes the MKL linking is using the CPU version of the library and when do we specify the "-mmic" flag? It looks like the instructions above are for the CPU and not the Phi.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.