Recipe: Building and Running VLPL-S for Intel® Xeon Phi™ Processors

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of VLPL-S on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of VLPL-S as well as the workload can be obtained by contacting Prof. Minchen at minchen@sjtu.edu.cn

II. Introduction

VLPL-S is the in-house code from Germany HHUD and PRC SJTU, paralleled with MPI and written in C++. The application is about a Particle-in-Cell method for laser plasma simulation by solving the particles motion equation, and the current density distribution and Maxwell equations. The version of VLPL-S is optimized for the performance on both Intel Xeon processors and Intel Xeon Phi processors. Optimizations in this package include:

  • Removing unnecessary computation and memory access
  • Improving cache hit rate by prefetch
  • Avoiding unnecessary precision conversion of constant and function call
  • Improving function call efficiency by removing the virtual function call and inter-procedural optimization
  • Vectorization

III. Preliminaries

  1. To build this package, install Intel® MPI Library 5.1 or higher and Intel® Parallel Studio XE 2016 or higher on your host system.
  2. Contact Prof. Minchen at minchen@sjtu.edu.cn to get the optimized VLPL-S source package and test workload. Please specify that you’d like the Intel Recipes version.
  3. Set up the Intel MPI Library and Intel® Fortran Compiler environments.
     

    > source /opt/intel/compilers_and_libraries_<version>/linux/mpi/bin64/mpivars.sh

    > source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64

  4. To run VLPL-S on the Intel Xeon Phi processor, reboot the system with SNC-4 cluster mode and cache memory mode via BIOS settings. Please refer to Intel® Xeon Phi™ Processor - Memory Modes and Cluster Modes: Configuration and Use Cases for more details on memory configuration.

IV. Build VLPL-S for the Intel Xeon processor

  1. Unpack the source code to any directory of /home/<user>
     

    > tar xvf VLPL-S.tar.bz2

  2. Build the executables for the Intel Xeon processor.
     

    > cd /home/<user>/VLPL-S

    > cp Makfile.cpu Makefile

    The executables are located at the path of /home/<user>/VLPL-S/ with the names of v2d_sjtu.e.cpu

V. Build VLPL-S for the Intel Xeon Phi processor

  1. Build the executables for the Intel Xeon Phi processor.
     

    > cd /home/<user>/VLPL-S

    > cp Makfile.knl Makefile

    This builds the executables for the Intel Xeon Phi processor; the executables are located at the path of /home/<user>/VLPL-S, with the names of v2d_sjtu.e.knl

VI. Run VLPL-S on the Intel Xeon processor and Intel Xeon Phi processor

  1. Run VLPL-S with the test workload on the Intel Xeon processor.
     

    > cd /home/<user>/VLPL-S

    > mpirun –n 36 ./v2d_sjtu.e.cpu ./v2d.ini

  2. Run VLPL-S with the test workload on the Intel Xeon Phi processor. Make sure all of binary and workload files are located on KNL.
     

    > cd /home/<user>/VLPL-S

    > mpirun –n 272 ./v2d_sjtu.e.knl ./v2d.ini

VIII. Performance gain

For the test workload, the following graph shows the speedup achieved from the Intel Xeon Phi processor, compared to the Intel Xeon processor. As you can see, we get:

  • Up to 1.55x faster with the Intel® Xeon Phi™ processor 7210 compared to the two-socket Intel® Xeon® processor E5-2697 v4.
  • Up to 2.00x faster with the Intel® Xeon Phi™ processor 7250 compared to the two-socket Intel Xeon processor E5-2697 v4.

VLPL-S Performance Improvement with the Intel® Xeon Phi™ Processor

Comments on performance improvement on Intel Xeon Phi:

  • VLPL-S has good parallel scalability, and benefits from more cores. And the best performance on Intel® Xeon Phi™ 7250 is achieved with 272 MPI ranks, which means it can make good use of all the logical cores (272 cores).
  • VLPL-S code can be vectorized, and therefore 512 bit vector from AVX512 improves performance significantly.
  • VLPL-S also benefits from MCDRAM because of memory bandwidth bound.

Testing platform configuration:

Intel Xeon processor E5-2697 v4: Dual-Socket Intel Xeon processor E5-2697 v4, 2.3 GHz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), DDR4 128 GB, 2400 MHz, Oracle Linux* Server release 6.7

Intel Xeon Phi processor 7210 (64 cores): Intel Xeon Phi processor 7210, 64 cores, 256 threads, 1300 MHz core freq. (HT and Turbo ON), 1600 MHz uncore freq., MCDRAM 16 GB 6.4 GT/s, BIOS 10D42, DDR4 96 GB, 2133 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250, 68 cores, 272 threads, 1400 MHz core freq. (HT and Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 10R00, DDR4 96 GB, 2400 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

For more complete information about compiler optimizations, see our Optimization Notice.