Recipe: Building and Running Soft Sphere Simulation for Intel® Xeon® Processors and Intel® Xeon Phi™ Processors

I. Overview

This article provides a recipe for how to obtain, compile, and run the Soft Sphere Simulation sample code with the sample workload on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

II. Introduction

Soft Sphere Simulation is a 3D Molecular Dynamics simulation tool developed by IPE-CAS (Institute of Process Engineering), China. It uses soft spherical particles to simulate structured molecules. and the BKS (Beest-Kramers-Santen) Potential Model is implemented in this in-house code.

This version of Soft Sphere Simulation is optimized for the performance on both Intel Xeon processors and Intel Xeon Phi processors. Optimizations in this package include:

  • Replace division with reciprocal.

  • Use OpenMP* to do multithreading and use schedule to do load balancing.

  • Use SIMD reduction to do force vectorization.

  • Reorder particles’ memory to reduce the cache misses.

  • Compiler options tuning.

III. Preliminaries

1. To build this package, install the Intel® MPI Library 5.0 and Intel® Parallel Studio XE 2016 or higher products on your host system. Please contact Prof. Ge at wge@ipe.ac.cn to get the sample source code and workload packages. Please specify that you’re interested in the version used for the KNL recipes.

2. Set up the Intel MPI Library and Intel® Fortran Compiler environments:

> source /opt/intel/compilers_and_libraries_<version>/linux/mpi/bin64/mpivars.sh

> source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64

3. Configure the BIOS.

Enter BIOS à Advanced à Uncore configuration, set <cluster mode> to <quadrant> value and <memory mode> to <cache> value.

Please refer to Intel® Xeon Phi™ Processor x200- Memory Modes and Cluster Modes: Configuration and Use Cases for more details on memory configuration.

If you do not run the Soft Sphere Simulation on an Intel Xeon Phi processor, the BIOS configuration is not necessary.

IV. Build Soft Sphere Simulation with Intel® Advanced Vector Extensions 2 (Intel® AVX2) for both Intel Xeon and Xeon Phi processors

1. Unpack the source code to any directory of /home/<user>

> tar xvfj SSS_opt_20160512.tar.bz2

This will create the SSS_build directory.

2. Build the executables with AVX2 compiler option.

> cd /home/<user>/SSS_build

> make -f Makefile.opt ARCH=hsw

This will build an executable with Intel AVX2 for the Intel Xeon and Xeon Phi processors which run with OpenMP threading. The executable is located at the path of /home/<user>/SSS_build, with the name of main_opt.avx2.

V. Build Soft Sphere Simulation for Intel Xeon Phi processor with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions

Build the executables for the Intel Xeon Phi processor.

> cd /home/<user>/SSS_build

> make -f Makefile.opt ARCH=knl

This will build an executable with Intel AVX-512 for the Intel Xeon Phi processor which runs with OpenMP threading. The executable is located at the path of /home/<user>/SSS_build, with the name of main_opt.avx512.

VI. Run Soft Sphere Simulation with the sample workload on the Intel Xeon processor

1. Prepare the initial data file.

      > cd /home/<user>/SSS_build

      > mpiicpc -o init init.cpp

      > ./init

This will create the initial input data file par0.dat at the path of /home/<user>/SSS_build. The workload is in 100x100x100 particles scale, with a 5x cut-off radius and 200 timesteps.

Note: Do this step only once if the simulation results are to be comparable among multiple runs.

2. Run Soft Sphere Simulation with the sample workload on the Intel Xeon processor.

      > cd /home/<user>/SSS_build

      > export OMP_NUM_THREADS=70

      > export KMP_AFFINITY=scatter,granularity=thread

      > ./main_opt.avx2

This will run Soft Sphere Simulation with 70 OpenMP threads.

3. Check the performance. The screen output shows the time taken every 20 timesteps, and the total time in the last line, measured in seconds. The following is an example output.

finished 20: amaxmove:0.00200282 t: 0.06902

finished 40: amaxmove:0.0040116 t: 0.075454

finished 60: amaxmove:0.00602861 t: 0.0672879

finished 80: amaxmove:0.00805818 t: 0.0673699

finished 100: amaxmove:0.0101044 t: 0.0672908

finished 120: amaxmove:0.0121708 t: 0.0673959

finished 140: amaxmove:0.0142609 t: 0.0673859

finished 160: amaxmove:0.0163804 t: 0.076736

finished 180: amaxmove:0.0185331 t: 0.0676651

finished 200: amaxmove:0.0207227 t: 0.067317

run time: 14.0609

The total time is 14.0609 seconds in this run.

This section is applicable for the Intel Xeon Phi processor. Interested users can run this executable on the Intel Xeon Phi processor with more threads to observe the performance change. The performance here can also serve as the baseline of Intel AVX2 to the one with Intel AVX-512 that is measured in the next section.

VII. Run Soft Sphere Simulation with the sample workload on the Intel Xeon Phi processor

1. This step is the same as the one in section VI. If it has been done in section VI, you can skip it here.

2. Run Soft Sphere Simulation with the sample workload on the Intel Xeon Phi processor.

      > cd /home/<user>/SSS_build

      > export OMP_NUM_THREADS=136

      > export KMP_AFFINITY=scatter,granularity=thread

      > ./main_opt.avx512

This will run Soft Sphere Simulation with 136 OpenMP threads.

3. Check the performance. Take the same step as section VI step 3.

VIII. Performance gain

For the sample workload, the following graph shows the speedup achieved from the Intel Xeon Phi processor, compared to the Intel Xeon processor. As you can see, we get substantial performance improvements:

  • Up to 1.60x faster with the Intel® Xeon Phi™ processor 7210 compared to the 2-socket Intel® Xeon® processor E5-2697 v4.

  • Up to 1.81x faster with the Intel® Xeon Phi™ processor 7250 compared to the 2-socket Intel Xeon processor E5-2697 v4.

 

 

Based on the performance improvement on the Intel Xeon Phi processor, we know the following:

  • Soft Sphere Simulation has good parallel scalability and benefits from more cores. However, since the kernel loop is constrained by the division resource and is sensitive to L1 cache misses, 2 threads per core outperform 4 threads per core. Therefore, the best performance on the Intel Xeon Phi processor 7250 is achieved with 136 threads even though there are as many as 272 threads available.

  • Soft Sphere Simulation is well vectorized, so the added register size available with Intel AVX-512 improves performance significantly.

  • Soft Sphere Simulation also benefits from MCDRAM moderately, although it is not memory bandwidth bound with a memory footprint that fits into MCDRAM completely.

Testing platform configuration:

Intel Xeon processor E5-2697 v4: Dual-socket Intel Xeon processor E5-2697 v4, 2.3 GHz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), DDR4 128 GB, 2400 MHz, Oracle Linux* Server release 6.7.

Intel Xeon Phi processor 7210 (64 cores): Intel Xeon Phi processor 7210, 64 cores, 256 threads, 1300 MHz core freq. (HT and Turbo ON), 1600 MHz uncore freq., MCDRAM 16 GB 6.4 GT/s, BIOS 10D28, DDR4 96 GB, 2133 MHz, Red Hat 7.2, quad cluster mode, MCDRAM cache memory mode.

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250, 68 core, 272 threads, 1400 MHz core freq. (HT and Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 10D28, DDR4 96 GB 2400 MHz, Red Hat 7.2, quad cluster mode, MCDRAM cache memory mode.

 

For more complete information about compiler optimizations, see our Optimization Notice.