## Overview

This case study demonstrates the ease of porting Sandia National Laboratories* Mantevo* Project MiniFE application to the Intel Xeon Phi coprocessor.

## Introduction

Sandia National Laboratories’ Mantevo Project is a collection of mini-applications designed to mimic the core functionality of widely used high-performance computing (HPC) algorithms. The implicit finite element method is modeled by the mini-application HPCCG and its derivatives, including a mini-application named MiniFE, which encapsulates the most significant performance characteristics of an implicit finite element method application in 1,500 lines of C++ code.

In this case study we share our experience porting an MPI parallel implementation of MiniFE to the Intel Xeon Phi Coprocessor.

The MiniFE application is small enough to experiment with a variety of parallel and vectorization schemes, without the complexity of the typical finite element application with more than one million lines of code. The iterative solver at the heart of MiniFE exhibits the familiar performance characteristics of larger simulators modeling fluid and structural dynamics. MiniFE serves as a proxy for these full-scale applications.

MiniFE is a self-contained, stand-alone code. It carries out a full finite element generation, assembly, and solution. The physical domain is a three-dimensional box modeled by hexahedral elements (sometimes called “brick” elements). The box is discretized as a structured grid but treated as unstructured. The domain is decomposed for parallel execution using recursive coordinate bisection (RCB).

With its domain decomposed for distributed-memory execution, MiniFE provides an opportunity to employ the Intel® MPI Library for Linux* OS, which supports execution of MPI tasks on both Intel Xeon Phi coprocessors and host system.

## Analysis

MiniFE’s dominant computational kernel is a conjugate gradient iterative solver with no preconditioning. The conjugate gradient algorithm consists of a sparse matrix-vector product, plus some dot product operations and vector updates. Most time is spent in the sparse matrix-vector product, whose performance is generally known to be limited by available memory bandwidth.

The sparse matrix-vector product at the heart of MiniFE is available as an Intel® Math Kernel Library (Intel® MKL) subroutine call, leading to the question, “Will the Intel MKL routine improve the performance of the sparse matrix-vector product in MiniFE?”

Intel MKL is a highly optimized math library especially suitable for computationally intensive scientific, engineering and financial applications. Core math functions such as BLAS, LAPACK, sparse solvers, and Fast Fourier Transforms are extensively threaded and optimized for Intel Xeon Phi coprocessors. Developers can take advantage of Intel Xeon Phi coprocessors by linking their applications with Intel MKL.

The sparse-matrix vector product for a compact matrix storage format called Compressed Sparse Row (CSR) is a doubly nested loop in Figure 1.

for (int row = 0; row < n; row++) { double sum = 0.0; for (int i = Arowoffsets[row]; i < Arowoffsets[row+1]; i++) { sum += Acoefs[i] * x[Acols[i]]; } y[row] = sum; }

Figure 1. Sparse Matrix-Vector Product, y=Ax

A goal of this effort is to exceed the performance of the sparse matrix-vector product on a two-socket server with Intel® Xeon® E5-2670 processors.

## Implementation

This implementation uses a pre-production Intel® Xeon Phi™ coprocessor with 61 cores running at 1.091 GHz, 8 GB GDDR5-2750 RAM at 5.5 GT/s, µOS version 2.6.34.11-g4af9302 with flash version 2.1.01.0372 and software stack version 2.1.3653-8. Error-correcting code (ECC) memory mode is enabled. The host system consists of dual eight-core Intel® Xeon® E5-2670 2.6 GHz processors, 32 GB DDR3-1600 RAM, with QPI at 8.0 GT/s, Linux* OS version 2.6.32-220.el6.x86_64. For both systems we compile using Intel® C++ Composer XE version 2013.0.079.

These commands build MiniFE for the server with Intel Xeon processors and the Intel Xeon Phi coprocessor:

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias –xAVX –o miniFE.x

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias -mmic –o miniFE.x.mic

The -*mmic* option instructs the Intel C++ Composer XE to cross-compile for native execution on the Intel Xeon Phi coprocessor.

We launch MiniFE on the server with Intel Xeon processors and the Intel Xeon Phi coprocessor using these commands:

mpiexec -n 16 ./miniFE.x nx=200 ny=200 nz=200

mpiexec –host mic0 -n 60 -wdir /tmp ./miniFE.x.mic nx=200 ny=200 nz=200

For a second test, we build an MPI plus OpenMP* hybrid, which uses threading to parallelize the outer loop of the sparse matrix-vector product, as in Figure 2.

#pragma omp parallel for firstprivate(n) for (int row = 0; row < n; row++) { double sum = 0.0; for (int i = Arowoffsets[row]; i < Arowoffsets[row+1]; i++) { sum += Acoefs[i] * x[Acols[i]]; } y[row] = sum; }

Figure 2. Sparse Matrix-Vector Product, Outer Loop OpenMP Parallel

Adding *–openmp* to the compile commands enables the OpenMP directive.

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias –openmp –xAVX –o miniFE.x

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias –openmp –mmic –o miniFE.x.mic

We execute the MPI-OpenMP hybrid MiniFE with these commands for the server with Intel Xeon processors and the Intel Xeon Phi coprocessor.

mpiexec -n 16 –env OMP_NUM_THREADS=1 ./miniFE.x nx=200 ny=200 nz=200

mpiexec –host mic0 -n 60 –env I_MPI_PIN_DOMAIN=omp –env OMP_NUM_THREADS=4-wdir /tmp ./miniFE.x.mic nx=200 ny=200 nz=200

Finally, we substitute an Intel MKL routine CSRGEMV (CSR format GEneral Matrix-Vector product) for the doubly nested loop (Figure 3). A detailed description of CSRGEMV appears in the Intel MKL Reference Manual.

#include <mkl_spblas.h> mkl_cspblas_dcsrgemv ( “N”, &n, Acoefs, Arowoffsets, Acols, x, y );

Figure 3. Intel MKL Sparse Matrix-Vector Product

The Intel MKL subroutine is OpenMP parallel when linked with the *–mkl=parallel* option.

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias –xAVX –openmp –mkl=parallel –o miniFE.x

mpiicc -restrict -farray-notation -O3 -ansi-alias -fno-alias -mmic –openmp –mkl=parallel –o miniFE.x.mic

This version of MiniFE is executed using the same command as the MPI-OpenMP hybrid version.

MiniFE reports the elapsed time spent in the conjugate gradient solver and also reports the time in the sparse matrix-vector product within the solver. Comparing all three versions of MiniFE on the two platforms in Tables 1 and 2, we see that the sparse matrix-vector product is indeed taking most of the solver time.

MiniFE* Double Precision |
Two Intel® Xeon® E5-2600 Processors |
Pre-Production Intel® Xeon Phi™ Coprocessor |
---|---|---|

MPI Original |
1.80 seconds |
4.13 seconds |

MPI-OpenMP* Hybrid |
1.80 seconds |
1.90 seconds |

MPI-Intel® MKL Hybrid |
1.80 seconds |
1.29 seconds |

Table 1. Sparse Matrix-Vector Product Time

MiniFE* Double Precision |
Two Intel® Xeon® E5-2600 Processors |
Pre-Production Intel® Xeon Phi™ Coprocessor |
---|---|---|

MPI Original |
2.31 seconds |
4.60 seconds |

MPI-OpenMP* Hybrid |
2.31 seconds |
2.34 seconds |

MPI-Intel® MKL Hybrid |
2.31 seconds |
1.77 seconds |

Table 2. Conjugate Gradient Iterative Solver Time

With the Intel MKL sparse matrix-vector product, the Intel Xeon Phi coprocessor exceeds the performance of the server with two Intel Xeon processors. The Intel MKL sparse matrix-vector product improves Intel Xeon Phi coprocessor’s performance because it is specially tuned for the Intel Xeon Phi coprocessor using software prefetching and intrinsic functions. Intrinsics are an API extension to the Intel C++ Compiler XE enabling easy implementation of SIMD instructions.

The MiniFE simulation can be executed more quickly using the Intel Xeon processors and the Intel Xeon Phi coprocessor together as a heterogeneous cluster. The following MPI command executes MiniFE on both the host processors and the coprocessor.

mpiexec –host localhost -n 16 –env OMP_NUM_THREADS=1 ./miniFE.x nx=200 ny=200

nz=200 : -host mic0 -n 20 –env I_MPI_PIN_DOMAIN=omp –env OMP_NUM_THREADS=12

-wdir /tmp ./miniFE.x.mic

Table 3 shows the performance of the heterogeneous system.

MiniFE* Double Precision |
Two Intel® Xeon® E5-2600 Processors |
Pre-Production Intel® Xeon Phi™ Coprocessor |
---|---|---|

MPI-Intel® MKL Hybrid |
2.31 seconds |
1.77 seconds |

MPI-Intel® MKL Hybrid |
1.17 seconds |

Table 3. Conjugate Gradient Iterative Solver Time

Adding the pre-production Intel Xeon Phi coprocessor to the server with Intel Xeon processors improves the iterative solver performance by a factor of two.

## Conclusion

The MPI application MiniFE is easily ported to Intel Xeon Phi coprocessor. The commands for building and running MiniFE are nearly identical for Intel Xeon processors and Intel Xeon Phi coprocessors.

Increasing the parallelism in MiniFE by adding OpenMP directives brings the performance of a pre-production Intel Xeon Phi coprocessor to parity with two Intel Xeon processors. Substituting an Intel MKL routine for the sparse matrix-vector product makes the coprocessor faster. The Intel MKL subroutine also provides an easy path forward, as the Intel MKL is continually updated to perform well on new architectures.

For an MPI parallel application, the Intel MPI Library enables rapid performance improvement when adding an Intel Xeon Phi coprocessor. Existing code is easily recompiled for the coprocessor. Familiar MPI commands launch MPI processes on both host and coprocessor.

## Additional Resources

Intel® Math Kernel Library (Intel® MKL) Reference Manual, Document Number 630813-052US

Intel® MPI Library for Linux* OS Reference Manual, Document number 315399-011

Intel® C++ Compiler XE 13.0 User and Reference Guides, Document number 323273-130US

Sandia National Laboratories* Mantevo* Project https://software.sandia.gov/mantevo/

## Acknowledgements

The authors thank the Intel MKL team, especially Sergey Pudov.

### About the Authors

Gregg Skinner is an Application Engineer with Intel Developer Relations Division, Software and Services Group (SSG).

Michael A. Heroux is a Distinguished Member of the Technical Staff at Sandia National Laboratories, Scalable Algorithms Department.** **

** **