GTC-P (Gyrokinetic Toroidal Code - Princeton) for Intel® Xeon Phi™ Coprocessor

Authors: Rezaur Rahman (Intel Corporation, OR), Bei Wang (Princeton University, NJ)

Code Access

GTC-P code is maintained by Princeton Plasma Physics Lab (PPPL) and is available under the Theory Code Licensing agreement from the PPPL on request. The code supports the symmetric mode of operation of the Intel® Xeon® processor (Referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (Referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To get access to the code:

  1. You can submit the request on this web site indicating you want the GTC-P code: http://theorycodes.pppl.wikispaces.net/Theory+Department+Codes.
  2. Request access to the Intel® Xeon Phi™ coprocessor version of the code. You need version 2.1 or later.

Build Directions

  1. You will need Intel® Composer XE 2013 or newer C/C++ and Fortran compiler and Intel® MPI Library 4.1.1 or newer.
  2. Get the 2.1 or newer version of GTC-P from the PPPL team.
  3. Set environment variables for Intel Composer XE and Intel MPI.
  4. To build the host version of MPI/OpenMP* executable, do
    1. $ make ARCH=icc.xeon clean
    2. $ make ARCH=icc.xeon
    3. This will create the binary bench_gtc
  5. To build the coprocessor version of MPI/OpenMP executable, do
    1. $ make ARCH=icc.mic clean
    2. $ make ARCH=icc.mic
    3. This will create the binary bench_gtc.mic. You can execute this natively on coprocessor
  6. To build the MIC-symmetric version of MPI/OpenMP executable, do
    1. $ make ARCH=icc.symmetric clean
    2. $ make ARCH=icc.symmetric
    3. This will create the binary bench_gtc.symmetric. You can execute this binary in symmetric mode where MPI processes are run on the host and the coprocessor simultaneously.

Run Directions

Symmetric Mode Execution on a Cluster

GTC-P currently supports symmetric mode execution on the Intel® Xeon® processor and the Intel® Xeon Phi™ coprocessor-based cluster, which means, MPI ranks run on both the processor and coprocessor. You need to build both the gtc_bench and gtc_bench.symmetric to execute them in parallel on the processor and coprocessor. You can find the instructions for setting up a cluster with Intel® Xeon Phi™ coprocessor cards here http://software.intel.com/en-us/articles/configuring-intel-xeon-phi-coprocessors-inside-a-cluster.

To run on a cluster with one coprocessor card per node, do the following:

  1. Set up the workload with npe_radiald = 2.
  2. Set up the hostfile to contain the host nodes and corresponding coprocessor nodes. For example, to use two nodes in a cluster, your setup may look like this:
    1. Node1
    2. Node1-mic0
    3. Node2
    4. Node2-mic0
  3. Set export I_MPI_MIC=enable. This will allow MPI ranks to run on the coprocessor and communicate with host MPI ranks.
  4. Set export I_MPI_MIC_POSTFIX=.mic. This will automatically add a prefix (.mic) to the executable when the mpirun script runs the MPI job on the Xeon Phi coprocessor cards.
  5. Set the environment variables to invoke MPI runtime on host.
  6. Start the application run as follows:
    1. Enter mpiexec.hydra -r ssh -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 -prepend-rank -perhost 1 -f hostfile -n 4 ~/gtc/run
    2. Where: the “-genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1” environment variables are used to select Infiniband fabrics for reduced communication overhead.
    3. The hostfile contains the list of processor and coprocessor nodes to execute on.
    4. ~/gtc/run tells the MPI runtime to execute the “run” script on the host processor and “run.mic” on the coprocessor from the ~/gtc folder accessible from both locations.

The script files are given below for your reference:

runsymmetric.sh :invoke this script with number of nodes including MIC nodes to run on, example, “./runsymmetric.sh 4”

export I_MPI_MIC=enable

export I_MPI_MIC_POSTFIX=.mic

export KMP_AFFINITY=scatter

source /opt/intel/impi/latest/bin64/mpivars.sh

mpiexec.hydra -r ssh -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 -prepend-rank -perhost 1 -f hostfile -n $1 ~/gtc/run

run: script to start the run on host nodes. Its job is to setup path and environment variables specific to host run. ./bech_gtc is the executable to run on MIC.

export OMP_NUM_THREADS=24

export KMP_AFFINITY=scatter

source /opt/intel/impi/latest/bin64/mpivars.sh

source /opt/intel/compiler/latest/bin/compilervars.sh intel64

./bench_gtc A.txt 200 1

run.mic: script to start the run on MIC coprocessors. Its job is to setup path and environment variables specific to MIC run. ./bech_gtc.symmetric is the executable to run on MIC.

export PATH=/opt/intel/impi/latest/mic/bin:$PATH

export LD_LIBRARY_PATH=/opt/intel/itac/latest/mic/slib:/opt/intel/impi/latest/mic/lib:/opt/intel/compiler/2013_sp1.1.106/composerxe/lib/mic:~/:$LD_LIBRARY_PATH;

export KMP_AFFINITY=compact

export OMP_NUM_THREADS=240

~/gtc/bench_gtc.symmetric A.txt 200 1

GTC-P Parallelism

GTC-P includes three levels of decomposition: domain decomposition in the toroidal dimension, domain decomposition in the radial dimension, and particle decomposition within each subdomain. The number of toroidal domains is given as a command line argument, ntoroidal. The number of particle copies in each subdomain is given in the input file npe_radiald. The number of radial domains is calculated dynamically as: total_pe/(ntoroidal * npe_radiald), where total_pe is the total number of MPI processes in the simulation.

When running the code on the host or on the coprocessor only , we usually set npe_radiald=1 (turn off particle decomposition). However, when running in symmetric mode with one MIC per node, it is important that we set npe_radiald=2. This enforces that the host and the MIC share the same subdomain, but each carries half the number of particles in that subdomain. Sharing the same subdomain between the host and the MIC avoids running some grid-based subroutines repeatedly on MIC, where those grid-based subroutines are usually more efficient on the host than on the MIC. In addition, when running in symmetric mode, we set TOROIDAL_FIRST=0 at bench_gtc_opt.h. When TOROIDAL_FIRST=0, the MPI ranks are first placed in the particle decomposition dimension. This guarantees that the two MPI processes with the same toroidal domain and radial domain rank numbers are placed on the host and the MIC, respectively. For example, if you are using two nodes with four MPI processes with ntoroidal=2, npe_radiald=2, the processors and their associated process IDs are:

GTC-P Performance

The following runs were done on the Endeavor cluster at Intel.

Platform Configurations

 

Intel, the Intel logo, Ultrabook, and Core are trademarks of Intel Corporation in the US and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

 

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.