This code recipe describes how to get, build, and use the Quantum ESPRESSO code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture. This recipe focuses on how to run this code using explicit offload.
Quantum ESPRESSO is an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudo potentials. The Quantum ESPRESSO code is maintained by Quantum ESPRESSO Foundation and is available under the GPLv2 licensing agreement. The code supports the offload mode of operation of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.
To get access to the code and test workloads:
- Download the latest Quantum ESPRESSO version from http://www.quantum-espresso.org/download/
Clone the linear algebra package libxphi from Gibthub:
$ git clone https://github.com/cdahnken/libxphi.
- Untar the Quantum ESPRESSO tarball
$ tar xzf espresso-5.1.tar.gz
Source the Intel® compiler and Intel® MPI Library
$ source /opt/intel/composer_xe_2013_sp1.4.211/bin/compilervars.sh intel64 $ source /opt/intel/impi/latest/bin64/mpivars.sh
- Change to the espresso directory and run the configure script
$ cd espresso-5.1 $ export SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" $ export LAPACK_LIBS="-mkl=parallel" $ export BLAS_LIBS="-mkl=parallel" $ export FFT_LIBS="-mkl=parallel" $ export MPIF90=mpiifort $ export AR=xiar $ ./configure --enable-openmp
- Make sure make.sys (by editing make.sys) has the following configuration:
MANUAL_DFLAGS = -D__KNC_OFFLOAD FLAGS = -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__OPENMP $(MANUAL_DFLAGS) MOD_FLAG = -I<PATH_TO_PW> -I MPIF90 = mpif90 CC = icc F77 = ifort BLAS_LIBS = "-mkl=parallel" BLAS_LIBS_SWITCH = external LAPACK_LIBS = "-mkl=parallel" LAPACK_LIBS_SWITCH = external SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 FFT_LIBS = "-mkl=parallel"
You can add “-xHost -ansi-alias” to CFLAGS as well as FFLAGS.
- Build the Quantum ESPRESSO PW binary
$ make pw -j16
You should now have bin/pw.x
- Change to the directory you cloned libxphi to and execute the build script. Make sure you do this in the shell you have the Intel compilers and Intel MPI library sourced.
$ cd libxphi $ ./build-library.sh
You should now find two libraries: libxphi.so and libmkl_proxy.so
The build process is now complete.
A single Quantum ESPRESSO on a single node
The Quantum ESPRESSO binary compiled above initially has support for accelerated 3D FFT. Additionally, the library libxphi.so contains a number of linear algebra numerical routines invoked by Quantum ESPRESSO, particularly the numerically intensive ZGEMM BLAS3 routine for complex matrix-matrix multiplication. Instead of executing this routine via Intel® Math Kernel Library (Intel MKL), libxphi blocks the matrices and buffers them asynchronously to the card, where Intel MKL then executes the multiplication of the blocks and transfers the result back. When the Quantum ESPRESSO binary is created with the build instructions above, it will contain dynamic calls to the ZGEMM routine, which are usually satisfied by Intel MKL. To get offloaded ZGEMM in place, libxphi.so needs to be preloaded:
$ export LD_LIBRARY_PATH=$PATH_TO_LIBXPH:$LD_LIBRARY_PATH $ LD_PRELOAD=”$PATH_TO_LIBXPHI/libxphi.so” ./pw.x <pw arguments>
The last line executes the Quantum ESPRESSO binary pw.x with offloaded ZGEMM support. To make this easier, we provide a shell script that facilitates this preloading and just takes the binary and its arguments as input, so that the execution of an offloaded run would look like this:
$ <PATH_TO_LIBPXPHI>/xphilibwrapper.sh <PATH_TO_PW>/pw.x <pw arguments>.
In this case Quantum ESPRESSO will execute a single instance with OpenMP* threads (by default as many as you have cores) and offload FFT and ZGEMM to all the cores of the Intel Xeon Phi coprocessor.
Tuning the linear algebra offloading
To tune the offloading process, we need to understand the ZGEMM routine, which executes matrix-matrix multiplication
where α and β are complex numbers and C, A and B are matrices of dimension MxN, MxK, and KxN, respectively. The library libxphi.so now blocks this matrix-matrix multiplication, so that the resulting block-matrix multiplication consists of smaller blocks that are continuously streamed to the coprocessor and back. The size of those blocks can be defined by three parameters, M, N, and K, where m x n, m x k, and k x n are the dimensions of the C-, A- and B-block, respectively. By default, libxphi will block the matrices in sizes of m=n=k =1024.You can play with these values to achieve better performance, depending on your workload size. We have found that making m and n somewhat larger (m=n=2048) and playing with the size of k (between 512 and 1024) can yield very good results.
Block size can be set via the environment variables QE_MIC_BLOCKSIZE_M, QE_MIC_BLOCKSIZE_N, and QE_MIC_BLOCKSIZE_K. For example:
$ QE_MIC_BLOCKSIZE_M=2048 $ QE_MIC_BLOCKSIZE_N=2048 $ QE_MIC_BLOCKSIZE_K=512
An additional setting is required to avoid the offloading of small matrices, which might be more efficiently computed on the host instead of the coprocessor. With QE_MIC_OFFLOAD_THRESHOLD you can define the minimal number of floating point operations a matrix must have in order to get offloaded. The setting
$ export QE_MIC_OFFLOAD_THRESHOLD=20
achieves good results.
Partitioning the coprocessor
Partitioning the coprocessor leverages the advantages of multi-processing vs. multi-threading. It is somewhat similar to running Message Passing Interface (MPI) ranks on the coprocessor (a.k.a. symmetric usage model) although the MPI ranks are only on the host. Varying the number of ranks on the host can be used to partition each coprocessor into independent sets of threads. The vehicle to achieve independent thread-partitions is given by the KMP_PLACE_THREADS environment variable. In addition, using the environment variable OFFLOAD_DEVICES utilizes multiple coprocessors within the same system. Of course there is nothing wrong with using OpenMP instead of this proposed method; however, we found that portioning the coprocessor unlocks more performance–this is simply trading implicit locks at the end of parallel regions against absolutely independent executions. To ease the tuning process, a script is provided that generates the appropriate “mpirun”-command line.
$ ~/mpirun/mpirun.sh -h
-n: list of comma separated node names
-p: number of processes per socket (host)
-q: number of processes per mic (native)
-s: number of sockets per node
-d: number of devices per node
-e: number of CPU cores per socket
-t: number of CPU threads per core
-m: number of MIC cores per device
-r: number of MIC cores reserved
-u: number of MIC threads per core
-a: affinity (CPU) e.g., compact
-b: affinity (MIC) e.g., balanced
-c: schedule, e.g., dynamic
-0: executable (rank-0)
-x: executable (host)
-y: executable (mic)
-z: prefixed mic name
-i: inputfile (<)
The script “mpirun.sh” is actually inspecting the system hardware in order to provide defaults for all of the above arguments. The script then launches “mpirun.py,” which actually builds and launches the command line for “mpirun.” This initial inspection, for example, avoids using multiple host sockets in case there is only one coprocessor attached to the system (avoids performing data transfers to a “remote” coprocessor). Any default provided by the launcher script “mpirun.sh” can be overridden at the command line (while still being able to leverage all other default settings). Please note that the script also supports symmetric execution (“-y”, etc.), which is discussed here.
Here is an example of running QE with four partitions on each of the coprocessor(s):
$ ./mpirun.sh -p4 -w <PATH_TO_LIBPXPHI>/xphilibwrapper.sh -x <PATH_TO_PW>/pw.x -i <input-file.in>
Any argument passed at the end of the command line is simply forwarded to the next underlying mechanism if not consumed by option processing. If you need to pass arguments to the executable using “<”, you can use the script’s “-i” option; otherwise, options for the executable can be simply appended to the above command line.
The number of ranks per host-socket (“-p”) is not only dividing the number of cores per host-processor but also dividing each coprocessor’s number of cores. Therefore some ratios produce some remaining unused cores. On the other hand, the coprocessor usually comes with more cores than cores in a single host socket/processor; therefore, it is likely acceptable and anyways a subject of tuning the number of partitions.
Figure 1: Performance of Quantum Espresso executing the GRIR443 benchmark on 16 Xeon E5-2697v2 and 16 Xeon Phi 7120A.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
- 2 MPI ranks/socket
- 6 threads/MPI rank
- Intel® Xeon® processor E5-2697 v2 64GB
- 64GB DDR3-1600
- RHEL 6.4
- Intel® Turbo Boost Technology /EIST/SMT/NUMA enabled
- 7120A, 61cores, 1.238GHz
- MPSS 2.1.6720-16
- ECC enabled, Turbo disabled
- Icc 14.0.0 update 1, Intel MPI Library 14.1.1.036