White paper: N-body simulation on Intel Xeon Phi Coprocessors

White paper: N-body simulation on Intel Xeon Phi Coprocessors

Hi All,

I am a rookie in the Intel Developer Zone, but I was fortunate to have early access to Intel Xeon Phi coprocessors for some time. Now that Intel has lifted the Xeon Phi coprocessor's veil, I am happy to share my user experience and demonstrate some examples.

I have written a white paper investigating a toy model, the N-body simulation, on an Intel Xeon Phi coprocessor. The PDF file can be downloaded here: http://research.colfaxinternational.com/file.axd?file=2013%2f1%2fColfax_... . The paper contains code samples, assembly listings and performance results, and discusses some programming practices for the efficient utilization of the coprocessor. Most importantly, it demonstrates that a single C/C++ code can efficiently run on Xeon processors and Xeon Phi coprocessors. Mind that the performance results are obtained on a pre-production sample, and the specifications and performance of the final product may be different.

If this paper has some utility to fellow developers and scientists, I will be thrilled. I am also looking forward to discussions on this forum and learning about the experience of other people with Xeon Phi programming.

Cheers,

Andrey.

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Andrey,

Thank you for contributing to the community.

Regards
--
Taylor

Hi Andrey

May I get your source code run on Xeon processors and Xeon Phi coprocessors.

Regards,

zhu.j

@zhu.j

The part of the source code is in N-body simulation white paper.

@Andrey Vladimirov

Thank you for publishing your white paper about the N-body simulation.It helped me a lot to understand how to implement N-body simulation.

@iliyapolak 

I hope to have the whole code and try to run it on my own platform。

thanks

The part of the code which is missing are calls to particles initializers.

thanks

Quote:

iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.

 

thanks

Quote:

iliyapolak wrote:

The part of the code which is missing are calls to particles initializers.

 

Thank you!

I managed to add Windows multithreading now I am trying to properly synchronize the threads.I run it on Haswell CPU.

Hmm, I thought that we included a complete code listing in the paper. The particle initialization step uses a random number generator for MKL. It is not a very physically meaningful setup, just some numbers to crunch.

In any case, here is a code that can be run (sorry if spacing is off):

#include <math.h>

	#include <mkl_vsl.h>

	#include <omp.h>

	#include <stdio.h>
struct ParticleSystemType {
  float *x, *y, *z, *vx, *vy, *vz;
};
int main(const int argc, const char** argv) {

	  const int nParticles = 30000, nSteps = 10; // Simulation parameters

	  const float dt = 0.01f; // Particle propagation time step

	  ParticleSystemType p; // Particle system stored as a structure of arrays

	  float *buf = (float*) malloc(6*nParticles*sizeof(float)); // Malloc all data

	  p.x  = buf+0*nParticles; p.y  = buf+1*nParticles; p.z  = buf+2*nParticles;

	  p.vx = buf+3*nParticles; p.vy = buf+4*nParticles; p.vz = buf+5*nParticles;
 
  // Initialize particles

	  VSLStreamStatePtr rnStream;  vslNewStream( &rnStream, VSL_BRNG_MT19937, 1 );

	  vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, rnStream, 6*nParticles, buf, -1.0f, 1.0f);

	 

	  // Propagate particles

	  printf("Propagating particles using %d threads...n", omp_get_max_threads());

	  double rate = 0.0, dRate = 0.0; // Benchmarking data

	  const int skipSteps = 1; // Skip first iteration is warm-up on Xeon Phi coprocessor

	  for (int step = 1; step <= nSteps; step++) {

	    const double tStart = omp_get_wtime(); // Start timing
 
    #pragma omp parallel for schedule(dynamic)

	    for (int i = 0; i < nParticles; i++) { // Parallel loop over particles that experience force

	      float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; // Components of force on particle i

	      

	      for (int j = 0; j < nParticles; j++) { // Vectorized loop over particles that exert force

	    if (j != i) {

	      // Newton's law of universal gravity calculation.

	      const float dx = p.x[j] - p.x[i];

	      const float dy = p.y[j] - p.y[i];

	      const float dz = p.z[j] - p.z[i];

	      const float drSquared    = dx*dx + dy*dy + dz*dz;

	      const float drPowerN32   = 1.0f/(drSquared*sqrtf(drSquared));

	      

	      // Reduction to calculate the net force

	      Fx += dx * drPowerN32;  Fy += dy * drPowerN32;  Fz += dz * drPowerN32;

	    }

	      }

	      // Move particles in response to the gravitational force

	      p.vx[i] += dt*Fx;        p.vy[i] += dt*Fy;        p.vz[i] += dt*Fz;

	    }

	    for (int i = 0; i < nParticles; i++) { // Not much work, serial loop

	      p.x [i] += p.vx[i]*dt; p.y [i] += p.vy[i]*dt; p.z [i] += p.vz[i]*dt;

	    }

	    

	    const double tEnd = omp_get_wtime(); // End timing

	    if (step > skipSteps) // Collect statistics

	      { rate  += 1.0/(tEnd - tStart); dRate += 1.0/((tEnd - tStart)*(tEnd-tStart)); }

	    printf("Step %d: %.3f secondsn", step, (tEnd-tStart));

	  }

	  rate/=(double)(nSteps-skipSteps); dRate=sqrt(dRate/(double)(nSteps-skipSteps)-rate*rate);

	  printf("Average rate for iterations %d through %d: %.3f +- %.3f steps per second.n",

	     skipSteps+1, nSteps, rate, dRate);

	  free(buf);

	}
 

 

The compilation commands on Linux are:

icpc -xhost -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody nbody.c
icpc -O3 -openmp -mkl -fimf-domain-exclusion=8 -o nbody-mic nbody.c -mmic

The flag "-fimf-domain-exclusion=8" disables denormal number handling in math functions.

Hi!

For information with version 14 of Intel compiler the -fimf-domain-exclusion=8 option is no longer improving the code speed. 

Thanks for the interesting comments on the code optimization. 

 

 

 

 

 

 

 

 

Login to leave a comment.