This code recipe describes how to get, build, and use the Quantum ESPRESSO code that includes support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture. This recipe focuses on how to run this code using explicit offload.
Quantum ESPRESSO is an integrated suite of open source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudo potentials. The Quantum ESPRESSO code is maintained by Quantum ESPRESSO Foundation and is available under the GPLv2 licensing agreement. The code supports the offload mode of operation of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel® Xeon Phi™ coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.
To get access to the code and test workloads:
Clone the linear algebra package libxphi from Gibthub:
$ git clone https://github.com/cdahnken/libxphi.
$ tar xzf espresso-5.1.tar.gz
Source the Intel® compiler and Intel® MPI Library
$ source /opt/intel/composer_xe_2013_sp1.4.211/bin/compilervars.sh intel64 $ source /opt/intel/impi/latest/bin64/mpivars.sh
$ cd espresso-5.1 $ export SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64" $ export LAPACK_LIBS="-mkl=parallel" $ export BLAS_LIBS="-mkl=parallel" $ export FFT_LIBS="-mkl=parallel" $ export MPIF90=mpiifort $ export AR=xiar $ ./configure --enable-openmp
MANUAL_DFLAGS = -D__KNC_OFFLOAD FLAGS = -D__INTEL -D__FFTW -D__MPI -D__PARA -D__SCALAPACK -D__OPENMP $(MANUAL_DFLAGS) MOD_FLAG = -I<PATH_TO_PW> -I MPIF90 = mpif90 CC = icc F77 = ifort BLAS_LIBS = "-mkl=parallel" BLAS_LIBS_SWITCH = external LAPACK_LIBS = "-mkl=parallel" LAPACK_LIBS_SWITCH = external SCALAPACK_LIBS = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 FFT_LIBS = "-mkl=parallel"
You can add “-xHost -ansi-alias” to CFLAGS as well as FFLAGS.
$ make pw -j16
You should now have bin/pw.x
$ cd libxphi $ ./build-library.sh
You should now find two libraries: libxphi.so and libmkl_proxy.so
The build process is now complete.
The Quantum ESPRESSO binary compiled above initially has support for accelerated 3D FFT. Additionally, the library libxphi.so contains a number of linear algebra numerical routines invoked by Quantum ESPRESSO, particularly the numerically intensive ZGEMM BLAS3 routine for complex matrix-matrix multiplication. Instead of executing this routine via Intel® Math Kernel Library (Intel MKL), libxphi blocks the matrices and buffers them asynchronously to the card, where Intel MKL then executes the multiplication of the blocks and transfers the result back. When the Quantum ESPRESSO binary is created with the build instructions above, it will contain dynamic calls to the ZGEMM routine, which are usually satisfied by Intel MKL. To get offloaded ZGEMM in place, libxphi.so needs to be preloaded:
$ export LD_LIBRARY_PATH=$PATH_TO_LIBXPH:$LD_LIBRARY_PATH $ LD_PRELOAD=”$PATH_TO_LIBXPHI/libxphi.so” ./pw.x <pw arguments>
The last line executes the Quantum ESPRESSO binary pw.x with offloaded ZGEMM support. To make this easier, we provide a shell script that facilitates this preloading and just takes the binary and its arguments as input, so that the execution of an offloaded run would look like this:
$ <PATH_TO_LIBPXPHI>/xphilibwrapper.sh <PATH_TO_PW>/pw.x <pw arguments>.
In this case Quantum ESPRESSO will execute a single instance with OpenMP* threads (by default as many as you have cores) and offload FFT and ZGEMM to all the cores of the Intel Xeon Phi coprocessor.
To tune the offloading process, we need to understand the ZGEMM routine, which executes matrix-matrix multiplication
where α and β are complex numbers and C, A and B are matrices of dimension MxN, MxK, and KxN, respectively. The library libxphi.so now blocks this matrix-matrix multiplication, so that the resulting block-matrix multiplication consists of smaller blocks that are continuously streamed to the coprocessor and back. The size of those blocks can be defined by three parameters, M, N, and K, where m x n, m x k, and k x n are the dimensions of the C-, A- and B-block, respectively. By default, libxphi will block the matrices in sizes of m=n=k =1024.You can play with these values to achieve better performance, depending on your workload size. We have found that making m and n somewhat larger (m=n=2048) and playing with the size of k (between 512 and 1024) can yield very good results.
Block size can be set via the environment variables QE_MIC_BLOCKSIZE_M, QE_MIC_BLOCKSIZE_N, and QE_MIC_BLOCKSIZE_K. For example:
$ QE_MIC_BLOCKSIZE_M=2048 $ QE_MIC_BLOCKSIZE_N=2048 $ QE_MIC_BLOCKSIZE_K=512
An additional setting is required to avoid the offloading of small matrices, which might be more efficiently computed on the host instead of the coprocessor. With QE_MIC_OFFLOAD_THRESHOLD you can define the minimal number of floating point operations a matrix must have in order to get offloaded. The setting
$ export QE_MIC_OFFLOAD_THRESHOLD=20
achieves good results.
Partitioning the coprocessor leverages the advantages of multi-processing vs. multi-threading. It is somewhat similar to running Message Passing Interface (MPI) ranks on the coprocessor (a.k.a. symmetric usage model) although the MPI ranks are only on the host. Varying the number of ranks on the host can be used to partition each coprocessor into independent sets of threads. The vehicle to achieve independent thread-partitions is given by the KMP_PLACE_THREADS environment variable. In addition, using the environment variable OFFLOAD_DEVICES utilizes multiple coprocessors within the same system. Of course there is nothing wrong with using OpenMP instead of this proposed method; however, we found that portioning the coprocessor unlocks more performance–this is simply trading implicit locks at the end of parallel regions against absolutely independent executions. To ease the tuning process, a script is provided that generates the appropriate “mpirun”-command line.
$ ~/mpirun/mpirun.sh -h
-n: list of comma separated node names
-p: number of processes per socket (host)
-q: number of processes per mic (native)
-s: number of sockets per node
-d: number of devices per node
-e: number of CPU cores per socket
-t: number of CPU threads per core
-m: number of MIC cores per device
-r: number of MIC cores reserved
-u: number of MIC threads per core
-a: affinity (CPU) e.g., compact
-b: affinity (MIC) e.g., balanced
-c: schedule, e.g., dynamic
-0: executable (rank-0)
-x: executable (host)
-y: executable (mic)
-z: prefixed mic name
-i: inputfile (<)
The script “mpirun.sh” is actually inspecting the system hardware in order to provide defaults for all of the above arguments. The script then launches “mpirun.py,” which actually builds and launches the command line for “mpirun.” This initial inspection, for example, avoids using multiple host sockets in case there is only one coprocessor attached to the system (avoids performing data transfers to a “remote” coprocessor). Any default provided by the launcher script “mpirun.sh” can be overridden at the command line (while still being able to leverage all other default settings). Please note that the script also supports symmetric execution (“-y”, etc.), which is discussed here.
Here is an example of running QE with four partitions on each of the coprocessor(s):
$ ./mpirun.sh -p4 -w <PATH_TO_LIBPXPHI>/xphilibwrapper.sh -x <PATH_TO_PW>/pw.x -i <input-file.in>
Any argument passed at the end of the command line is simply forwarded to the next underlying mechanism if not consumed by option processing. If you need to pass arguments to the executable using “<”, you can use the script’s “-i” option; otherwise, options for the executable can be simply appended to the above command line.
The number of ranks per host-socket (“-p”) is not only dividing the number of cores per host-processor but also dividing each coprocessor’s number of cores. Therefore some ratios produce some remaining unused cores. On the other hand, the coprocessor usually comes with more cores than cores in a single host socket/processor; therefore, it is likely acceptable and anyways a subject of tuning the number of partitions.
Figure 1: Performance of Quantum Espresso executing the GRIR443 benchmark on 16 Xeon E5-2697v2 and 16 Xeon Phi 7120A.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804