White paper: N-body simulation on Intel Xeon Phi Coprocessors

White paper: N-body simulation on Intel Xeon Phi Coprocessors

Hi All,

I am a rookie in the Intel Developer Zone, but I was fortunate to have early access to Intel Xeon Phi coprocessors for some time. Now that Intel has lifted the Xeon Phi coprocessor's veil, I am happy to share my user experience and demonstrate some examples.

I have written a white paper investigating a toy model, the N-body simulation, on an Intel Xeon Phi coprocessor. The PDF file can be downloaded here: http://research.colfaxinternational.com/file.axd?file=2013%2f1%2fColfax_... . The paper contains code samples, assembly listings and performance results, and discusses some programming practices for the efficient utilization of the coprocessor. Most importantly, it demonstrates that a single C/C++ code can efficiently run on Xeon processors and Xeon Phi coprocessors. Mind that the performance results are obtained on a pre-production sample, and the specifications and performance of the final product may be different.

If this paper has some utility to fellow developers and scientists, I will be thrilled. I am also looking forward to discussions on this forum and learning about the experience of other people with Xeon Phi programming.



12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


Thank you for contributing to the community.


Hi Andrey

May I get your source code run on Xeon processors and Xeon Phi coprocessors.




The part of the source code is in N-body simulation white paper.

@Andrey Vladimirov

Thank you for publishing your white paper about the N-body simulation.It helped me a lot to understand how to implement N-body simulation.


I hope to have the whole code and try to run it on my own platform。


The part of the code which is missing are calls to particles initializers.



iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.




iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.


Thank you!

I managed to add Windows multithreading now I am trying to properly synchronize the threads.I run it on Haswell CPU.

Hmm, I thought that we included a complete code listing in the paper. The particle initialization step uses a random number generator for MKL. It is not a very physically meaningful setup, just some numbers to crunch.

In any case, here is a code that can be run (sorry if spacing is off):

#include <math.h>

	#include <mkl_vsl.h>

	#include <omp.h>

	#include <stdio.h>
struct ParticleSystemType {
  float *x, *y, *z, *vx, *vy, *vz;
int main(const int argc, const char** argv) {

	  const int nParticles = 30000, nSteps = 10; // Simulation parameters

	  const float dt = 0.01f; // Particle propagation time step

	  ParticleSystemType p; // Particle system stored as a structure of arrays

	  float *buf = (float*) malloc(6*nParticles*sizeof(float)); // Malloc all data

	  p.x  = buf+0*nParticles; p.y  = buf+1*nParticles; p.z  = buf+2*nParticles;

	  p.vx = buf+3*nParticles; p.vy = buf+4*nParticles; p.vz = buf+5*nParticles;
  // Initialize particles

	  VSLStreamStatePtr rnStream;  vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );

	  vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, 6*nParticles, buf, -1.0f, 1.0f);


	  // Propagate particles

	  printf("Propagating particles using %d threads...n", omp_get_max_threads());

	  double rate = 0.0, dRate = 0.0; // Benchmarking data

	  const int skipSteps = 1; // Skip first iteration is warm-up on Xeon Phi coprocessor

	  for (int step = 1; step <= nSteps; step++) {

	    const double tStart = omp_get_wtime(); // Start timing
    #pragma omp parallel for schedule(dynamic)

	    for (int i = 0; i < nParticles; i++) { // Parallel loop over particles that experience force

	      float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; // Components of force on particle i


	      for (int j = 0; j < nParticles; j++) { // Vectorized loop over particles that exert force

	    if (j != i) {

	      // Newton's law of universal gravity calculation.

	      const float dx = p.x[j] - p.x[i];

	      const float dy = p.y[j] - p.y[i];

	      const float dz = p.z[j] - p.z[i];

	      const float drSquared    = dx*dx + dy*dy + dz*dz;

	      const float drPowerN32   = 1.0f/(drSquared*sqrtf(drSquared));


	      // Reduction to calculate the net force

	      Fx += dx * drPowerN32;  Fy += dy * drPowerN32;  Fz += dz * drPowerN32;



	      // Move particles in response to the gravitational force

	      p.vx[i] += dt*Fx;        p.vy[i] += dt*Fy;        p.vz[i] += dt*Fz;


	    for (int i = 0; i < nParticles; i++) { // Not much work, serial loop

	      p.x [i] += p.vx[i]*dt; p.y [i] += p.vy[i]*dt; p.z [i] += p.vz[i]*dt;



	    const double tEnd = omp_get_wtime(); // End timing

	    if (step > skipSteps) // Collect statistics

	      { rate  += 1.0/(tEnd - tStart); dRate += 1.0/((tEnd - tStart)*(tEnd-tStart)); }

	    printf("Step %d: %.3f secondsn", step, (tEnd-tStart));


	  rate/=(double)(nSteps-skipSteps); dRate=sqrt(dRate/(double)(nSteps-skipSteps)-rate*rate);

	  printf("Average rate for iterations %d through %d: %.3f +- %.3f steps per second.n",

	     skipSteps+1, nSteps, rate, dRate);




The compilation commands on Linux are:

icpc -xhost -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody nbody.c
icpc -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody-mic nbody.c -mmic

The flag "-fimf-domain-exclusion=8" disables denormal number handling in math functions.


For information with version 14 of Intel compiler the -fimf-domain-exclusion=8 option is no longer improving the code speed. 

Thanks for the interesting comments on the code optimization. 









Leave a Comment

Please sign in to add a comment. Not a member? Join today