# Monte Carlo Pi using Xeon Phi

## Monte Carlo Pi using Xeon Phi

Hello,

I would like to port the following toy example to the Xeon Phi. Basically it's a Monte Carlo simulation for calculating pi using Intel TBB & MKL random generators in parallel. How would I need to rewrite the example so that the whole Monte-Carlo simulation is offloaded on the Xeon Phi and uses an optimal dynamically determined amount of TBB tasks?

#include "mkl_vsl.h"
#include "tbb/tick_count.h"

class PiCalculator {
public:
long numpoints;
long& in;
VSLStreamStatePtr stream;

PiCalculator(long numpoints, long& in, VSLStreamStatePtr stream) :
numpoints(numpoints), in(in), stream(stream) {
in = 0; // make sure to initialize to zero.
}

void operator()() {
const int block_size = 2048;
double variates[2 * block_size];
int nblocks, ntail, i, j;
nblocks = numpoints / block_size;
ntail = numpoints - nblocks * block_size;
for (j = 0; j < nblocks; j++) {
vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2 * block_size,variates, 0.0, 1.0);
for (i = 0; i < block_size; i++) {
double x = variates[2 * i + 0];
double y = variates[2 * i + 1];
if (x * x + y * y <= 1.0)
++(in);
}
}

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2 * ntail, variates,0.0, 1.0);

for (i = 0; i < ntail; i++) {

double x = variates[2 * i + 0];

double y = variates[2 * i + 1];

if (x * x + y * y <= 1.0)

++(in);

}

};

};

int main() {

int errorcode;

const long samples = 10000000000l;

int seed = 1;

for (int i = 0; i < tasks; i++) {

errorcode = vslNewStream(&stream[i], VSL_BRNG_MCG59, seed);

if (errorcode) {

return 1;

}

if (errorcode) {

return 1;

}

}

tbb::tick_count t0 = tbb::tick_count::now();

for (int i = 0; i < tasks; i++) {

}

group.wait();

tbb::tick_count t1 = tbb::tick_count::now();

long result = 0;

for(int i=0;i

result += results[i];

}

std::cout << "pi = " << 4.0 * result / samples << std::endl;

std::cout << "time [s]: " << (t1-t0).seconds();

for (int i = 0; i < tasks; i++)

vslDeleteStream(&stream[i]);

return 0;

4 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Rewrote the code sample using openmp with offload to MIC. Could someone please confirm that the code is optimal as I will use it as an example in a presentation on the xeon Phi. Thanks!

doubleMonteCarloPi(longsimulations)

{

longinCircle = 0;

{

intseed = 1;

//iniatilizerandom generator streams

vslNewStream( &stream[i], VSL_BRNG_MT2203+i, seed);

}

#pragmaopenmp parallel reduction(+:inCircle)

{

constintblock_size = 2048;

doublevariates[2 * block_size];

intnblocks, ntail, i, j;

ntail = samplesPerThread - nblocks * block_size;

for(j = 0; j < nblocks; j++) {

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, streams[omp_get_thread_num()], 2 * block_size,variates, 0.0, 1.0);

for(i = 0; i < block_size; i++) {

doublex = variates[2 * i + 0];

doubley = variates[2 * i + 1];

if(x * x + y * y <= 1.0)

inCircle++;

}

}

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, streams[omp_get_thread_num()], 2 * ntail, variates,0.0, 1.0);

for(i = 0; i < ntail; i++) {

doublex = variates[2 * i + 0];

doubley = variates[2 * i + 1];

if(x * x + y * y <= 1.0)

inCircle++;

}

}// endopenmpparallel

for(inti = 0; i < numThreads; i++)

vslDeleteStream(&stream[i]);

return4.0 * inCircle/samples;

}

Hi Anwar,

1. Generally, this approach to computation of number pi on Intel(R) Xeon Phi(TM) looks reasonable as it exploits parallelization, data locality, and vectorization.

2. Please, pay attention to the problem size, parameter simulations. If it is small, the code would not use Intel(R) Xeon Phi(TM) effectively.

3. You may want to additionally tune the performance of the code keeping in mind several parameters including size of L1/L2 cache, number of threads you are going to use, size of array for random numbers, and data type. For example, if you are going to use 4 threads per Phi core, it might make sense sense to decrease the total size of the array for double random numbers, say check 2x512 or 2x256 (and block_size variable is 512 or 256). If you use smaller number of threads per core, say one, check 2x1024 in addition to 2x2048

4. Align array variates which is used to to store random numbers: __declspec(align(64)) double variates[].

5. In addition to MT2203 you may want to try say MCG59 BRNG and apply SkipAhead technique to get random streams ready for parallel Monte Carlo simulations. Number of random numbers to skip nskip is available to you in advance.

6. Try additional OpenMP affinity settings which would help to assign threads to cores of Intel(R) Xeon (TM) in the way effective for your application. Additional details are available, for example, at http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm)

Please, let me know if this helps or you have additional questions.

Andrey

Hi,

You guys might find the following case study helpful:

http://software.intel.com/en-us/articles/case-study-achieving-high-performance-on-monte-carlo-european-option-on-intel-xeon-phi

Also, the slides and the labs published at iXPTC 2013, describe the process of optimizing Monte Carlo for the Intel Xeon Phi coprocessor. This material can be found at:

http://software.intel.com/en-us/articles/ixptc-2013-presentations

I hope this helps.