Recently, a group of students from the Computer Science department at California Polytechnic State University (Cal Poly), were assigned to design and write programs to solve different High Performance Computing (HPC) problems as their final projects. One project used the Intel® Xeon Phi™ coprocessor to solve an N-Body problem (see “Parallelization and the Intel® Xeon Phi™ for the N-Body Gravitation Problem”).
The goal of the N-Body problem is to predict the motion of a set of n objects interacting with each other by some force, e.g. the gravitational force. N-Body simulations have been used in particles simulation such as astrophysical and molecular dynamics simulations. There are a number of approaches for solving the N-Body problem, such as the Barnes-Hut algorithm, the Fast Multipole method, the Parallel Multipole Tree Algorithm method and the direct calculation method. In the direct calculation approach, for every particle in the system, the force between that particle and every other particle in the system is calculated, resulting in an O(n2) operation for each time step where n is the number of particles in the system.
The acceleration of each particle at each time step is calculated from the forces applied to that particle by every particle in the system (the mutual gravity force between two particles is a function of distance between them). From this acceleration, the new position of each particle is calculated. The process is repeated for some number of time steps. This code portion is highly parallel and is a good candidate for offloading to the coprocessor for computing.
In their project, the authors used the following techniques to solve their N-Body problem:
Explicit Offload to the coprocessor: the host system transferred the scalar n and arrays of known size n to the coprocessor for processing. The arrays contain the current positions of n particles. At the start of the offload, the memory on the coprocessor was allocated, and the value of n and particle position arrays are copied into the coprocessor memory. The coprocessor performed all the calculations and stored the results in the particle acceleration arrays on the coprocessor. At the end of the offload, these acceleration arrays were copied back to the host system. The host used acceleration information to update the new positions of these particles.
Thread Parallelism: OpenMP* directives were used to take advantage of the large number of threads available on the coprocessor (each core has four hardware threads and about 60 cores available in a coprocessor) to speed the program execution up through parallelism. The parallel code was launched in the target system (coprocessor) and a team of OpenMP threads executed the parallel code (calculated the acceleration values of the particles).
Data Parallelism: SIMD (Single Instruction Multiple Data) vector instructions were used to accelerate execution within each thread. Intel® Xeon Phi™ coprocessors support Intel ® IMCI (Intel® Initial Many Core Instructions) instructions which include 512-bit wide SIMD registers and a fast vector instruction set. That is, it applied the same mathematical operation in parallel to an array of input (arrays of positions) instead of each input separately.
Data Alignment: data were placed at a memory address which was a multiple of the length of the vector register in bytes (for the coprocessor 64-byte alignment is required) in order to increase vectorization opportunities (all the position arrays and acceleration arrays are aligned).
Use of multiple coprocessors:the parallel code was divided in half and offloaded into two coprocessors to speed up data processing.
Structure-of-Array (SOA) data layout: the data offloaded to the coprocessor were organized as a Structure-of-Array where all x coordinates were in one array, y coordinates were in a separate array, and z coordinates were in yet another array. This data organization maximizes cache hits when data were processed.
As part of the project, the students allowed the optimization incrementally to evaluate the benefit of each. First, they ran the program on the host system (Intel® Xeon® processor E5-2609 v2) with no optimization as the baseline. They then added vectorization (data parallelism) which gave a 1.6x speedup. Next, they ran two threads on the host with vectorization which gave a 3.2x speedup. Offloading the parallel code to a coprocessor gave a 22x speedup. Finally, offloading the parallel code to two coprocessors gave a speedup of 42x over the baseline. The experiments were performed using a range of n from 100 to 32,000 particles. Completed details are shown in their project paper (“Parallelization and the Intel® Xeon Phi™ for the N-Body Gravitation Problem”).
In summary, the Intel® Xeon Phi™ coprocessor has a large number of cores with 4 threads each and 512-byte SIMD instructions to support thread-parallel, and data-parallel programming. The students at California Polytechnic State University mastered the following techniques in order to complete their final projects using the Intel® Xeon Phi™ processor: explicit offload to the coprocessor, vectorization, data alignment to take advantage of memory access, Structure-of-Array (SoA vs AoS), and using OpenMP to take advantage of the large number of core in the coprocessor. This work gave the students a successful experience and encouragement to pursue further optimizations. For more information on the Intel® Xeon Phi™ coprocessor, readers can visit its web page here.