parallel random number generation

parallel random number generation

Аватар пользователя Anwar Ludin

I would like to use the vsl random number generators in a parallel monte carlo simulation, ie the possibility to distribute the simulation on all processor cores. Regarding this I have 2 different cases: - In the first case I would like to accelerate a simulation by distributing it on multiple processor cores. For example, let s say I need to simulate 10000 runs with each run containing 5000 timesteps. That means that I need to generate 10000*5000 random variates. My simulation would look something like this:

#define SEED 1

#define BRNG VSL_BRNG_MCG31

#define METHOD VSL_RNG_METHOD_GAUSSIAN_ICDF

// initialize vsl random generator stream

VSLStreamStatePtr stream;

double a=0.0,sigma=0.3;

errcode = vslNewStream( &stream, BRNG, SEED );

for(int i=0; i<9999; i++){

// simulate one path by generating 5000 variates.

double r[5000];

vdRngGaussian( METHOD, stream, N, r, a, sigma );

for (int j=0;j<4999;j++){

// simulate random walk using the variates

}

}

I would like to parallelize the outer loop. My question is: is it safe to call vdRngGaussian from multiple threads? And am I guaranteed to have independant variates?


The second scenario would be to parallelize multiple simulations. In this case I would like to do one full simulation per thread and I need to to generate independant variates for all simulations. In this case my question would be what is the approach to generating the random variates? Should I use one rng per thread and initialize them with different seeds? I have been told that this is not the best way of getting independant variates. Another method would be to use the leapfrog method. What is best?



anwar

34 posts / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Andrey Nikolaev (Intel)
Best Reply

Hello Anwar,

Intel MKL Random Number Generators support parallel Monte Carlo simulations by means of the following methodologies:

1. Block-splitting which allows you to split the original sequence into k non-overlapping blocks, where k - number of independent streams. The first stream would generate the random numbers x(1),...,x(nskip), the second stream would generate the numbers x(nskip+1),...,x(2nskip), etc. Service function vslSkipAheadStream( stream, nskip ) is way to use this methodology in your app.

VSLStreamStatePtr stream[nstreams];
int k;
for ( k=0; k{
vslNewStream( &stream[k], brng, seed );
vslSkipAheadStream( stream[k], nskip*k );
}

2. Leap-Frogging that allows you to split the original sequence into k disjoint subsequences, where k - number of independent streams. The first stream would generate the random numbers x(1), x(k+1),x(2k+1),... the second stream would generate the numbers x(2), x(k+2),x(2k+2),...,etc. Service function vslLeapfrogStream( stream, k, nstreams ) will help you to parallelize the application by means of this approach.

VSLStreamStatePtr stream[nstreams];
int k;
for ( k=0; k{
vslNewStream( &stream[k], brng, seed );
vslLeapfrogStream( stream[k], k, nstreams );
}

3. Generator family which supports parallel Monte Carlo simulations by design: Whichmann-Hill Basic Random Number Generator (BRNG) helps you to get up to 273 independent random streams, and Mersenne Twsiter MT2203 - up to 6024 independent random streams

#define nstreams 6024
VSLStreamStatePtr stream[nstreams];

int k;
for ( k=0; k< nstreams; k++ )
{
vslNewStream( &stream[k], VSL_BRNG_MT2203+k, seed );
}

All those techniques will help you to have streams of indepedent variates. As soon as you create nstreams random streams you can call safely call MKL Gaussian or any other RNG with k-th stream stream[k] in thread safe way. Below are the additional notes related to parallelization of Monte-Carlo simulations:

1. To parallelize one simulation in your code you might use Block-Splitting or LeapFrog methodologies with MCG31, MCG59 or other MKL BRNG which supports them. Before the simulations please check that BRNG period addresses needs of your application in random numbers.

2. Please, avoid using the same VSL RNG per one thread initialized with different seeds. It can results in the biased output of the Monte-Carlo simulations. Instead, please, use one of the methodologies described above.

3. To parallelize multiple simulations you can use one of the methodologies above. Considerations related to BRNG period are applicable to this scenario too. Also, please, keep inmind the following aspects:

- to apply SkipAhead technique you will need to compute number of variates per one simulation, parameter nskip. If this number is not available (or difficult to compute) for some reasons in advance you might want to use Leapfrog technique

- Leapfrog technique, however, is recommended when number of independent streams k is fairly small

- Use of WH or MT2203 BRNG family could probably be the suitable option for your needs

You can find the details behind each methodology in VSLNotes, http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vslnotes.pdf, section 7.3.5 and in Intel MKL Manual, Chapter Statistical Functions, Section Random Number Generators, Service Routines, LeapFroagStream and SkipAheadStream and evaluate which of the methodologies fits your environment in the best way.

Also, it might be helpful to have a look at KB article that considers aspects for seed choice for of MKL BRNG initialization, http://software.intel.com/en-us/articles/initializing-Intel-MKL-BRNGs

Please, let me know if this addresses your questions. Feel free to ask more questions on parallelization of Monte Carlo simulations with MKL RNGs, we will be happy to help.

Best,
Andrey

Аватар пользователя Anwar Ludin

Andrey,

Thanks a lot for your feedback and please forgive me for this delayed response. Indeed your explanations clarified things a lot. I will now play around with the various generators using Intel TBB and come back to you with further questions :)

Anwar

Аватар пользователя Anwar Ludin

OOps sorry messed up with the rating system, wanted to give your answer a 4 star and clicked by accident on 2 stars...and now it seems i can t change the rating anymore...but it s definitely a 4 star!

Аватар пользователя Andrey Nikolaev (Intel)

Hello Anwar,
Sure, please, let us know when you have further questions on Statistical Component of Intel Math Kernel Library, we will be glad to help.
Best,
Andrey

Аватар пользователя Anwar Ludin

Andrey, Just to make sure I understood everything correctly. Is the following assertion correct: - Even in the case of one simulation distributed on many cores, I should initialize one stream per thread using leap-frogging or block-splitting. If I share the same stream amongst several threads, then not only it's not thread-safe but I will also introduce correlations in the simulation paths. Correct? Thanks in advance for your answer. Regards, Anwar

Аватар пользователя Andrey Nikolaev (Intel)

Anwar,
To share the same random streamamong several threads in thread-safe and "correlation-free" wayyou would need to manage the access to the random number generationthrough thread syncronization primitives (so, that at any time only one thread uses the stream for the producing random numbers). Otherwise, you would use the stream in not thread-safe manner whichpotentially canput correlations in your results.
To splitone simulation across cores I would suggestto use one of the parallelization approaches supportedby Intel MKL Random Number Generators and shortly described above. Use of those methodologies would also be advantageous from perspective of performance/effective use of multi-core resources and would help to avoid unnecessary thread syncronization.
Thanks,
Andrey

Аватар пользователя Anwar Ludin

Andrey,

OK thanks for your clarifications and thank a lot for your feedback!

Аватар пользователя Anwar Ludin

Andrey,

I'm hitting another stumbling block. I m trying to use the random generators with intel tbb and I m not sure how/when to initialize them correctly. As an example let s consider thecalculation of pi through monte carlo. I would like to use a tbbparallel_reduce algorithm for this. Here's an example i've taken from the tina rng manual that I would like to adapt to mkl. For one thing I can t know in advance how many streams I will need because this is taken care of by tbb. My guesswould be that I need to pass adifferent stream to each splitting constructor, but how can I know in advance how manystreams I willnead? Any help would be greatly appreciated!Regards, Anwar

#include
#include
#include
#include
#include

class parallel_pi {

trng::uniform01_dist<> u; // random number distribution

long in;

const trng::yarn2 &r;

public:

void operator()(const tbb::blocked_range &range) {

trng::yarn2 r_local; // local copy of random number engine

r_local.jump(2*range.begin()); // jump ahead

for (long i=range.begin(); i!=range.end(); ++i) {
double x=u(r_local), y=u(r_local); // choose random x and y coordinates
if (x*x+y*y<=1.0) // i s point in circle ?
++in; // increase thread local counter
} //for
}//operator

// join threads and counters

void join(const parallel_pi &other) {
in+=other.in;
}

long in_circle() const {
return in;
}

parallel_pi(const trng::yarn2 &r) : r, in(0) {

}

parallel_pi(const parallel_pi &other, tbb::split) : r(other.r), in(0) {

}

};

int main(int argc, char *argv[]) {

tbb::task_scheduler_init init;

const long samples=1000000l; // total number of points in square

trng::yarn2 r; // random number engine

parallel_pi pi; // functor for parallel reduce

// parallel MC computation of pi

tbb::parallel_reduce(tbb::blocked_range(0, samples), pi, tbb::auto_partitioner());

// print result

std::cout << "pi = " << 4.0*pi.in_circle()/samples << std::endl;

return EXIT_SUCCESS;

}

Аватар пользователя Andrey Nikolaev (Intel)

Hi Anwar,

I have quick suggestions you might want to have a look at.
1.You have the number of samples whose processing you want to split across threads, samples = 1M. I assume that you can check the maximal number of cores for your CPU, say ncores. So, you can assign the processing of range of size k=samples/ncore to one core.

For TBB blocked range you can specify the grainsize parameter: your blocked_range will not be split into two sub-ranges if the size of the range less than grainsize (please, see Section 4.2.1 of Intel TBB Manual for more details).

If you set the grain size to kyouwill avoid undersubcription (that is, the number of logical threads is not enough to keep physical threads working) and oversubsription (number of logical threads exceeds number of physical threads). On the next step youwould associate VSL Random stream with a thread in the operator()(const tbb::blocked_range &range) by using the boundaries of the range, e.g., idx=range.end())/k - 1. So, on high level the general scheme could look like this (the modification of this scheme are possible):
a) Determine number of cores on CPU ncore
b)Create ncore MKL Random Streams by applying one of VSL parallelization techniques
c)Construct object of type parallel_pi by providing number of cores, number of samples, and array of random streams
d)Compute index idx of the block being processed and obtain random numbers from the stream indexed idx. Use nsamples/ncore for grainsize in blocked_range.

2. Also, please, have a look at Chapter "Catalog of Recommended task Patterns". The methodology described there would allow creating and spawning k tasks; each task would process the random numbers obtained from specific VSL Random Stream created by using one of parallelization techniques.

Please, let me know how those approaches work for you.

Best,
Andrey

Аватар пользователя Anwar Ludin

Hi Andrey, I really want thank you for your time in answering all of my questions :). I tried to implement the second method, ie spawning tasks and using the leapfrog method.I must be doing something wrong because my program crashes just when the rng is called before generating the variates.

#include

#include "mkl_vsl.h"

#include "tbb/task_scheduler_init.h"

#include "tbb/task_group.h"

class PiCalculator {

public:

long numpoints;

long in;

VSLStreamStatePtr stream;

PiCalculator(long numpoints, long& in, VSLStreamStatePtr stream) :

numpoints(numpoints), in(in), stream(stream) {

}

void operator()(){

double variates[2*numpoints]; //we need 2 random variates per point

// crashes here EXC_BAD_ACCESS: Could not access memory

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2*numpoints, variates, 0.0, 1.0);

for(int i=0; i

double x=variates[i];

double y=variates[numpoints+i];

if(x*x+y*y<=1.0) ++in;

}

};

};

int main() {

int errorcode;

const long samples = 1000000l;

int seed = 1;

int nstreams = 2;

VSLStreamStatePtr stream[nstreams];

for (int i=0; i

{

errorcode = vslNewStream( &stream[i], VSL_BRNG_MCG31, seed );

if(errorcode){

return 1;

}

errorcode = vslLeapfrogStream( stream[i], i, nstreams );

if(errorcode){

return 1;

}

}

tbb::task_scheduler_init init;

tbb::task_group group;

long result1 = 0;

long result2 = 0;

group.run(PiCalculator(samples/2, result1, stream[0]));

group.run(PiCalculator(samples/2, result2, stream[1]));

group.wait();

std::cout << "pi = " << 4.0*(result1+result2)/samples << std::endl;

for(int i=0;i

vslDeleteStream(&stream[i]);

return 0;

}

Аватар пользователя Anwar Ludin

OK reduced the samples from 1M to 100000 and now the program no longer crashes. Also there was a bug in class PiCalculator: variable "in" should be declared as long& and not long. However I still don t understand why the program crashes if the sample size is large? normallyVSL_BRNG_MCG31 has a period of 2^31 which should be enough in this case. Besides from my understanding if the period was not long enough the variates would simply repeat?...

Аватар пользователя Andrey Nikolaev (Intel)

Hi Anwar,

Period of Intel MKL MCG31m1 BRNG should be sufficient for goals of this demo application. However, ifthe application requestsnumberof random variatesthat exceeds its period you would see repeated random numbers in the sequence.
The root of the crash seems to be in the operator() - the size of the buffer which is allocated on stackand is used for random number is 2 * samples * sizeof( double ) = 2 * 1M * 8 = 16 MB.
To avoid the issues with buffer size and to improve perfromance of the application Imodified your code as shown below. The essense of the changes is use of the buffer of the fixed size for random numbers.Size of the buffer is chosen to get the best performance of the application (you would probably need several experiments to choose the best size of the buffer).

void operator()( )
{
const int block_size = 1024;
double variates[2*block_size];
int nblocks, ntail, i, j;

nblocks = numpoints / block_size;
ntail = numpoints - nblocks * block_size;

for( j = 0; j < nblocks; j++ )
{
vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2*block_size, variates, 0.0, 1.0 );
for( i = 0; i < block_size; i++ )
{
double x = variates[2*i + 0];
double y = variates[2*i + 1];
if(x*x+y*y<=1.0) ++(in);
}
}

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2*ntail, variates, 0.0, 1.0 );
for( i = 0; i < ntail; i++ )
{
double x = variates[2*i + 0];
double y = variates[2*i + 1];
if(x*x+y*y<=1.0) ++(in);

}
}

Also, I holdin memberof the class PiCalculatorby reference:

public:
long& in;

Please, let me know how those modifications work on your side.

Thanks,
Andrey

Аватар пользователя brscth

Hi Andrey, I'm trying to do something similar in Fortran, but am having a bit of a hard time following the above posts (and the vslnotes.pdf file) as I know nothing about C. In particular, I'd like to parallelize a simple simulation using OpenMP. The idea is to generate NSIM independent samples of N observations, calculating some statistic of interest for each sample. For example: PROGRAM MC IMPLICIT NONE INTEGER I, N, NSIM PARAMETER (N=100, NSIM=1000000) DOUBLE PRECISION X(N), MUHAT(NSIM) !$OMP PARALLEL SHARED(MUHAT) PRIVATE(I,X) !$OMP DO DO I=1,NSIM CALL RANDOM_NUMBER(X) MUHAT(I) = SUM(X)/DBLE(N) END DO !$OMP END DO !$OMP END PARALLEL END PROGRAM MC I have no idea whether RANDOM_NUMBER() is actually thread-safe, but, in reality, I would be using some non-uniform (e.g., normal) generator from the MKL anyway. Any suggestions on which BRNG to use and whether to use leapfrogging or blocksplitting (or neither) would be very helpful. To put the problem in context, NSIM will be a very large (1 million), N is medium-sized (100 or 1000), and my machine has 12 cores (i.e., the parallel parts above are spread across 12 threads). Thanks in advance.

Аватар пользователя Anwar Ludin

Hi Andrey, Thanks a lot for your advice! I modified the code accordingly and now everything works like a charm! Your trick of breaking down the simulation in blocks in order to fit in memory is quite nice. I also played around with the number of tbb tasks and I get good speedup results with 50 tasks. Simulation runs in 70 sec with one task and in 19 sec using 50 tasks. Looking at my cpu bar I can see that all processor cores are used while running the simulation, showing that tbb correctly distributes the tasks on all physical threads. Here's the final version of the code with your modifications in case anyone wants to play around with it:

#include

#include "mkl_vsl.h"

#include "tbb/task_scheduler_init.h"

#include "tbb/task_group.h"

#include "tbb/tick_count.h"

class PiCalculator {

public:

long numpoints;

long& in;

VSLStreamStatePtr stream;

PiCalculator(long numpoints, long& in, VSLStreamStatePtr stream) :

numpoints(numpoints), in(in), stream(stream) {

in = 0; // make sure to initialize to zero.

}

void operator()() {

const int block_size = 2048;

double variates[2 * block_size];

int nblocks, ntail, i, j;

nblocks = numpoints / block_size;

ntail = numpoints - nblocks * block_size;

for (j = 0; j < nblocks; j++) {

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2 * block_size,variates, 0.0, 1.0);

for (i = 0; i < block_size; i++) {

double x = variates[2 * i + 0];

double y = variates[2 * i + 1];

if (x * x + y * y <= 1.0)

++(in);

}

}

vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, 2 * ntail, variates,0.0, 1.0);

for (i = 0; i < ntail; i++) {

double x = variates[2 * i + 0];

double y = variates[2 * i + 1];

if (x * x + y * y <= 1.0)

++(in);

}

};

};

int main() {

int errorcode;

const long samples = 10000000000l;

int seed = 1;

int tasks = 50;

VSLStreamStatePtr stream[tasks];

for (int i = 0; i < tasks; i++) {

errorcode = vslNewStream(&stream[i], VSL_BRNG_MCG59, seed);

if (errorcode) {

return 1;

}

errorcode = vslLeapfrogStream(stream[i], i, tasks);

if (errorcode) {

return 1;

}

}

tbb::task_scheduler_init init;

tbb::task_group group;

long results[tasks];

long samplesPerTasks = samples/tasks;

tbb::tick_count t0 = tbb::tick_count::now();

for (int i = 0; i < tasks; i++) {

group.run(PiCalculator(samplesPerTasks, results[i], stream[i]));

}

group.wait();

tbb::tick_count t1 = tbb::tick_count::now();

long result = 0;

for(int i=0;i

result += results[i];

}

std::cout << "pi = " << 4.0 * result / samples << std::endl;

std::cout << "time [s]: " << (t1-t0).seconds();

for (int i = 0; i < tasks; i++)

vslDeleteStream(&stream[i]);

return 0;

}

Аватар пользователя Andrey Nikolaev (Intel)

Hi Brscth,

At first glance, the problem you solve is similar to the problem Anwar has earlier described. You need to parallelize Monte Carlo simulations, and process random numbers using some algorithm(say, compute some statistics).

The dimensions you mentioned in the postindicatethat MKL Random Number Generatorstogether with OpenMP* look suitable choice for your simulations. The methodology for parallelization of the simulations with Intel MKL Random Number Generators is "language independent".Before choosing type of Basic RNG and parallelization methodology you would need to better understand requirements to the generator:
1. How many numbers would you need (especially if number of simulationswould potentially increase)?

2. What are the performance requirements?RNG performance data available at
http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm
could be useful to dosome perfromance estimatesfor your codewith different types ofRNGs.

3. Number of cores/threads you plan to use today (or even tomorrow).

4.Any other aspects that reflect specifics of your problem.

The list of the requirements would help you to choose MKL BRNG that meets requirements of your problem.If you choose MT2203BRNG which supports up to 6024 threads (in MKL 10.3.3)and has period~10^663 youmight wanttohave array of ncore VSL Random streams (ncore is 12 in your case). Each of them is initizalized in the usual way by means of Fortran version of the function NewStream vslnewstream. Youthen need to split NSIM simulations acrossncore threads that is, assign blocks of simulations to each thread. Assume, for simplicityNSIM=120. Then random stream #1 would serve block of simulations 1-10, second - block of 11-20, that is number of simulations per core is 10. Using RNG you will generate arrays of the observations in each block.On high levelit would look like this

do i=1:ncore
status = vslnewstream( stream(i), VSL_BRNG_MT2203+i, seed)
do j=1:sim_per_core
status=vdrnggaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream(i), n, r, a, sigma )
process array r of size n=1000
end do
status = vsldeletestream( stream(i) )
end do

Please, note that it would be more effective from perspective of performance to call VSL RNGs on vector lengths like few thousands. If number of observations is 100 you might want to groupseveralvectors of the observationsinto one call to the generator.

If you choose BRNG with skip-ahead or leapfrog feature for your simulations, e.g. MCG59BRNGthe computations would look similar.The only change is initialization by means of service functions vslleapfrogstream or vslskipaheadstream. MKL installationdirectory contains example/vslfwith Fortran examples that demonstrate RNG use; Skip-Ahead and Leap-Frog features are among them.

If you need to compute simple statistical estimates like mean or covariance you might want to do it with Summary Statistics ofVSL. As VSL RNG it provides Fortran API and can be called from your application. Please, note that Summary Stats routines incorporate threadingfeature while processingthe dataset of size p x n when appropriate.

Please, let me know when/if you parallelize the computations or ask more questions on Statistical Feature of MKL.

Best,
Andrey

Аватар пользователя Andrey Nikolaev (Intel)

Hi Anwar,
It is great to know that you got speed-up of the app with MKL RNGs,TBB on multi-core CPU.
Please, feel free to ask questions on Stat Component of Intel MKL, we would be ready to discuss and help.
Best,
Andrey

Аватар пользователя Anwar Ludin

Hi there,

I don't know if this can be useful, but here's a simple C program using OpenMP that I guess you could easily translate to Fortran. Here's a quick overview of what it does. First I start by finding how many threads I have available using omp_get_numprocs(). This will give me the number of random streams to initialize using an array of VSLStreamStatePtr. Then each stream is created and "configured" using the leapfrog method. In the parallel region I use once again the OpenMP runtime function omp_get_thread_num
to determine the treadid. This will give me the index of the corresponding stream to use from the array of streams previously initialized. I m also doing a toy reduction but this really could be anything you decide to calculate using the variates. The result of the reduction is available at the end of the parallel region and printed to the console.

[C]#include 
#include "omp.h"
#include "mkl_vsl.h"

int main() {

	int seed = 1;
	int tid;
	int numthreads = omp_get_num_procs();
	VSLStreamStatePtr stream[numthreads];
	int variatesPerThread = 5000;
	double variates[variatesPerThread];
	double a = 1.0;
	double sigma = 0.2;

	int result = 0;

	for (int i = 0; i < numthreads; i++) {
		int errorCode=0;

		errorCode = vslNewStream(&stream[i], VSL_BRNG_MCG59, seed);
		if(errorCode){
			printf("vslNewStream failed\n");
			return 1;
		}
		errorCode = vslLeapfrogStream(stream[i], i, numthreads);
		if(errorCode){
					printf("vslLeapfrogStream failed\n");
					return 1;
		}
	}

	#pragma omp parallel private(tid, variates) reduction(+:result)
	{
		tid = omp_get_thread_num();
		// generate the random samples and do something interesting.
		vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_ICDF, stream[tid],variatesPerThread, variates, a, sigma);
		// reduce the result
		result = result+tid;
	}
printf("result is: %d", result);
for (int i = 0; i < numthreads; i++)
		vslDeleteStream(&stream[i]);
	
}
[/C]
Аватар пользователя brscth
Hi Andrey, That is very helpful - thank you. I have included my modified code below, but still have a few questions. (1) Is it fine to set SEED as a shared scalar that is initialized outside my OMP directives? Other options would be to (a) set it is a private scalar that is initialized within each thread, or (b) set it a shared vector (of length NCORE) that is initialized outside my OMP directives. (2) Similar to (1), is it fine that STREAM is set as a private variable? I guess these questions have more to do with OMP, but I'm not sure how to deal with them when using the VSL. (3) Can I delete each stream independently? On the second-last line of your code snippet, you had status = vsldeletestream( stream ) but I think status = vsldeletestream( stream(i) ) would be more appropriate. More generally, am I correct to think that methods like skip-ahead or leapfrog would only be useful if there were more threads than streams? In my case, since the number of threads and streams are both equal to NCORE, I can't see any benefit to such methods. Thanks again! PROGRAM MC IMPLICIT NONE INCLUDE "mkl_vsl.f77" INTEGER N, NSIM, NCORE PARAMETER (N=1000, NSIM=100000, NCORE=10) c make sure that NSIM/NCORE is an integer INTEGER STREAM(NCORE), SEED, I, J, STATUS DOUBLE PRECISION X(N), MUHAT(NCORE,NSIM/NCORE), MU, SIGMA SEED = 42 MU = 0.0D0 SIGMA = 1.0D0 CALL OMP_SET_NUM_THREADS(NCORE) !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(SEED, MU, SIGMA) !$OMP DO DO I=1,NCORE STATUS=VSLNEWSTREAM(STREAM(I),VSL_BRNG_MT2203+I,SEED) DO J=1,(NSIM/NCORE) STATUS=VDRNGGAUSSIAN(VSL_RNG_METHOD_GAUSSIAN_ICDF, & STREAM(I),N,X,MU,SIGMA) c now do whatever with x END DO STATUS=VSLDELETESTREAM(STREAM(I)) END DO !$OMP END DO !$OMP END PARALLEL END PROGRAM MC
Аватар пользователя Andrey Nikolaev (Intel)

Hi, thanks for the questions.

1. I do not expect any issues if seed would be a shared scalar - in the threads the library reads this value once to initialize the state of the basic random generator. Also, as another option you could initialize the array of streams prior toyour OpenMP directives, in serial part of the program.

2. It is important that i-the entry of the array stream is assigned to threadindexed i; andthe expectation is that i-th thread/core would use only i-th entry (stream)ofthe array to produce random numbers and will not use j-th entry.From this perspective, the array can be shared among the threads.

3. Yes, you can delete i-th stream in i-th thread independently. You are correct - Imodified my previous post to have status=vsldeletestream( stream(i) ).

4. Choice of the parallelization methodology (Skip-ahead, Leap-Frog, BRNG family) is entirely defined by you and requirements of your problem. In some cases (due to specifics of the problem)it might be moresuitable to useLeap-Frog or Skip-Ahead methodologies for parallelization of Monte Carlo simulations. How many streams (or blocks) to use inyour environementwill be definedon your side - youinitialize as manyIntel MKL Random Streams as you need/want.If the example we are considering for 12 core based computer you could set number ofVSL random streamsto 6 if, say, you plan to use the rest cores for other computations.

Please, let me know if this answers your questions.

Best,
Andrey

Аватар пользователя brscth

THanks Andrey, this has been very helpful.

Аватар пользователя Andrey Nikolaev (Intel)

You are welcome. Please, let me know if you have any questions on Statistical features of Intel MKL.
Best,
Andrey

Аватар пользователя Matthias R.

Similar Parallel Simulation Problem

Hi, I need to find the default probability of a derivative via Monte-Carlo simulation with C++ in VS2010 with Intel TBB and MKL installed and only 1GB of memory.

Let S(t) denote the price of the derivative at time t. The current price is S(0) = 100.

For simplicity, the derivative is defined by (the real derivative is more complex):
With a probability of 99 percent S(t+1) = S(t) * exp(X), with X ~ N(mu=0, sigma=0.01)
With a probability of 1 percent the derivative jumps down to S(t+1) = S(t) * 0.4

If the derivative falls below 10 somewhere in 0 < t <=250 the following happens:
With ha probability of 80 percent the price of the derivative at S(t) is set to 1.5 * S(t) or with a probability of 10 percent we will have a default.

Let's say I need to run at least 10mn simulations of this derivative and count all losses.
Then my simulated default probability is #defaults / 10mn. The serial code somehow looks like:

for j = 1 to #simulation/{
for t = 1 to 250{
generate S(t+1)
check if S(t+1) defaults
}
}

How do I parallelize the code?
What is the best strategy for the random number generation? I cannot generate the random numbers of type double a priori, because 10mn * 250 * 2 (at least) = 500mn random number * 8Byte = 4GB.

Is it a good idea to dived the number of simulations into chunks of 10 * number of processors?

VSLStreamStatePtr stream[10*#processors];
for i = 1 to 10*#processor{
vslNewStream( &stream[i], VSL_BRNG_MT2203+i, seed );
}

tbb_parallel_for i = 1 to 10*#processors{
use stream[i] for random number generation
generate 10mn / (10*#processor) * 250 random numbers ~ N(0, 0.01) and store them in a vector.
generate 10mn / (10*#processor) * 250 random numbers ~ Bernoulli(0.01) and store them in a vector.
generate 10mn / (10*#processor) * 250 random numbers ~ Bernoulli(0.01) and store them in a vector.
for j = 1 to #simulation/(10*#processors){
use
for t = 1 to 250{
generate S(t+1) using the vectors filled with random numbers
check if S(t+1) defaults
}
}
}

Any help would be appreciated...
Matt

Аватар пользователя Andrey Nikolaev (Intel)

Hi Matt,

there are several quick observations and thoughts are below:

1. The blocking approach you look at is worth to try, but could be further improved in terms of memory usage and performance. Say, if your system has 8 processors, then you would need 1mn / 8 * 250 * 8 ~ 250 MB for Gaussian numbers. You also need Bernoulli numbers what means additional memory.

2. To split simulations between processors it makes sense to keep in mind number of processors, cores on each processor, size of last level cache, in addition to size of memory (I assume you use shared memory computer). I would also pay attention to 1st level cache. The important aspect is that your threads should not "compete" for cache.

3. It appears that second set of Bernoulli numbers is used once S(t) < 10 at any time t. It means that number of Bernoulli variates required to process one path is not exactly 250. It can be smaller than 250, say 1,2 or even zero (we do not know in advance). This reveals opportunity to save memory and random numbers. The same is the case for Gaussian numbers (howerver, we expect that 99% of Gaussian array will be used).

So, I would imagine this approach (which, of course, does not exclude other options, as missed details can be also important).
Let M - number of cores (we also can dettach M from number of cores and try smaller or larger values)

for (i = 0; i < M; i++)
{
vslNewStream( &stream[i], VSL_BRNG_MT2203+i, seed );
}

paths = 10000000;
paths_per_core = paths / M;
// for simplicity assume that paths % M = 0; if this is not the case, the remaining paths should be properly processed

const int T = 250; // number of Gaussian numbers per one path
const int K = 4; // number of paths generated per one call to Gaussian RNG. We need to try different values from 1 till several units
const int N = 1024; //number of Bernoulli of random numbers used to simulate a default. Can try different values.

int b_idx = N;

int paths_per_core1 = path_per_core / K;
// remember about case path_per_core % K <> 0

// this loop admit parallelization using OMP or TBB, etc
for ( i = 0; i < M; i++ )
{
double rd[T*K];
int ri[T*K];
int ri1[N];
// total amount of memory for random numbers is M*(2*T*K+N), that is, ~108 KB for 8 core machine

// loop over paths assigned to i-th core
for ( j = 0; j < paths_per_core1; j++ )
{
// block of calls to RNGs
status = vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream[i], T*K, rd, a, sigma );
status = viRngBernoulli( method, stream, T*K, ri, p );

// loop over K paths
for ( kk = 0; kk < K; k++ )
{
// loop over time
for (t = 0; t < T; t++ )
{
// use kk block of arrays rd and ri to simulate price S
S[t+1] = ...
if ( b_idx >= N )
{
status = viRngBernoulli( method, stream, N, ri1, p1 );
b_idx = 0;
}
def = ri1[b_idx];
if ( def < 0.8 ) ...
}
}
} // for ( j = 0; j < paths_per_core1; j++ )

if ( paths_per_core1 * K < path_per_core )
{
// process remaining paths similarly to the algorithm above
}
}

Another quick observation: if you used Intel(R) C++ compiler the loop over time could be vectorized, at first glance, and you would get additional performance speed-up. However, this is the case only if you would generate maximal number of Bernoulli variates for a default in advance in the block of calls to RNGs (both calls to Bernoulli RNG can be combined in one). In this case the loop over time would not contain any calls to MKL functions.

As your algorithm uses exp() function, you also might want to exploit opportunity to use vector math functions available in Intel(R) MKL.
The algorithm appears to estimate probability of a default, and one possible way to do this is to use Intel(R) MKL Summary Stats aglorithm for comutation of mean.

Please, let me know if this help or you have additional questions.

Thanks,
Andrey

Аватар пользователя Matthias R.

Hi Andrey,

thank you so much for your very helpful reply. I followed some of your instructions and finally divided the number of simulations into chunks of 100,000 and stored all random numbers needed for one chunk in vector<double> data types.
Furthermore, I introduced another serial for loop (serial_iter) to avoid memory problems...

Code:
const int number_iterations =  number_sims / 100000;
const int processors = 20;
const int number_sims_chunk = number_sims / processors / number_iterations;
VSLStreamStatePtr stream[6024];
for (int i = 0; i < 6024; ++i) {
     vslNewStream( &stream[i], VSL_BRNG_MT2203+i, seed );
}
for (int serial_iter = 0; serial_iter < number_iterations; ++serial_iter) {
     #pragma omp parallel for
     for (int p = 0; p < processors; ++p) {
          //Random numbers used for standard case (99% of these random numbers are used in code)
          vector<double> vecUniform(number_sims_chunk * number_days, 0.0);
          vector<double> vecGaussian(number_sims_chunk * number_days, 0.0);
          //Random numbers needed for some check
          //(< 1% of these random numbers are used in code)
          //--> need to be improved
          vector<double> vecUniform2(number_sims_chunk * number_days, 0.0);
          //Some other Random numbers only needed for postprocessing of default (again < 1% )
          vector<double> vecGaussian3(number_sims_chunk * liq_period, 0.0);
          vector<double> vecUniform3(number_sims_chunk * liq_period, 0.0);  
          vdRngUniform( VSL_RNG_METHOD_UNIFORM_STD_ACCURATE, stream[serial_iter * processors + p], vecUniform.size(), &vecUniform[0], 0.0, 1.0 );
          vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream[serial_iter * processors + p], vecGaussian.size(), &vecGaussian[0], 0.0, 1.0 );
          vdRngUniform( VSL_RNG_METHOD_UNIFORM_STD_ACCURATE, stream[serial_iter * processors + p], vecUniform2.size(), &vecUniform2[0], 0.0, 1.0 );
          vdRngUniform( VSL_RNG_METHOD_UNIFORM_STD_ACCURATE, stream[serial_iter * processors + p], vecUniform3.size(), &vecUniform3[0], 0.0, 1.0 );
          vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_ICDF, stream[serial_iter * processors + p], vecGaussian3.size(), &vecGaussian3[0], 0.0, 1.0 );
          for (int z1 = 1; z1 <= number_sims_chunk; ++z1) {
                 serial code with variable loss changed from double data type to tbb::atomic<double> to sum up losses
                 ...
               

The output was a nice Excel library (xll) containing one function. Calling the function from Excel gave me a speedup of 5.5 on two Xeon X5650 in comparison to the serial code.
I was happy to see Excel using 11 cores at rougly 80% for a couple of minutes :-)
Unfortunately, I cannot use the Intel(R) C++ compiler in my company (we only have the MKL library avaialble) and couldn't figure out how to do the memory allignment needed for SIMD (SSE) in VS2010.
Do you have any ideas how to vectorize my code?
Can I make the following run faster using other data types having memory allignment?
      vector<double> vecUniform(number_sims_chunk * number_days, 0.0);
      vdRngUniform( VSL_RNG_METHOD_UNIFORM_STD_ACCURATE, stream[serial_iter * processors + p], vecUniform.size(), &vecUniform[0], 0.0, 1.0 );
    
Anyways, thank you so much again for your help.
All the best, Matt

Аватар пользователя Andrey Nikolaev (Intel)

Hi Matt,

I'm  glad to know that you developed the algorithm which uses HW resources in the effective way, excellent progress!

To my understanding, in many (not all) cases Monte Carlo simulations can demonstrate close to linear scalability over number of cores. You mentioned about 5.5x speed-up on ~11 cores. Does it mean that your code still spends visible amount of time in serial sections? Which scalability do you expect for your algorithm, say 10x for 11 cores or different?

One possible way to vectorize the code in your environment is to use SSE2 intrinsics, see, for example, http://msdn.microsoft.com/en-us/library/9b07190d(v=vs.80).aspx . However, intrinics based approach is not "scalable": once you want to migrate your code to Intel(R) AVX with 256 bit registers or try Intel(R) Xeon Phi with 512 bit vector registers, you would need to re-develop your vectorized code again to make effective use of wider vector registers. Intel(R) C++ compiler does this work for you.

Yes, please, apply memory alingment to your arrays when ever possible (however, I'm not sure that this would make the application faster in that specific part of the code).

You also might want to try VSL_RNG_METHOD_UNIFORM_STD method for generation of uniform random numbers only if you do not have reasons for use of the accurate method.

Please, let me know if you have more questions.

Thanks,

Andrey

Аватар пользователя Ray

Hi Andrey

I tried running an example with independent streams in Intel Fortran 12.1.5, and I'm a little surprised by the results. Why are the STARTING bolded values from streams 1 and 5, and the italicized values from streams 2 and 4 so similar? Is this normal? The output and the code are given below:

Thanks,

Ray

stream 1
0.36630
0.78807
0.40149
1.61842
1.04421

stream 2
-0.50169
0.86743
-1.32459
-0.29372
-0.16414

stream 3
-0.52681
0.84276
0.85682
-0.25308
-0.17656

stream 4
-0.50169
0.86565
0.92212
-0.09324
-0.00586

stream 5
0.36630
0.78807
0.23478
1.61870
0.69967

And here's the code

include 'mkl_vsl.f90'
include "errcheck.inc"
program MKL_VSL_TEST
USE MKL_VSL_TYPE
 USE MKL_VSL
 integer, parameter :: n = 1000, nstreams = 5
 integer(kind=4) errcode
 real(kind=4) :: a,sigma,r(n)
 integer :: brng,method,seed
 integer :: loopCount
 TYPE (VSL_STREAM_STATE), dimension(nstreams) :: stream
 brng=VSL_BRNG_MT2203
 method=VSL_RNG_METHOD_GAUSSIAN_ICDF
 seed=777
a=0.0
 sigma=1.0
do loopCount = 1,nstreams
 ! ***** Initialize *****
 errcode=vslnewstream( stream(loopCount), brng + loopCount-1, seed )
 call CheckVslError(errcode)
! ***** Call RNG *****
 errcode=vsrnggaussian( method, stream(loopCount), n, r, a, sigma)
 call CheckVslError(errcode)
! *******Print Stuff *****
 print '(A,i2)','stream',loopCount
 print '(f8.5)', (r(i), i = 1,5)
 print *
! ***** Deinitialize *****
 errcode=vsldeletestream( stream(loopCount) )
 call CheckVslError(errcode)
 end do
end program MKL_VSL_TEST

Further, I need to run one indepedent Markov Chain Monte Carlo sequence per stream. The problem is that beforehand, I do not know exactly how many random numbers I will need (perhaps I can calculate a minimum number of random numbers needed) in each chain (stream). Is using the vector statistical library still the right approach?

Аватар пользователя Andrey Nikolaev (Intel)

Hi Ray,

Your first question on repeated random numbers returned by MT2203 BRNGs is discussed in the article http://software.intel.com/en-us/articles/initializing-Intel-MKL-BRNGs

Answer to your second question - yes: you can use streams of independent random numbers generated by MT2203 BRNGs to produce your MCMC. If, for some reasons, MT2203 approach does not work, you might want to try, for example, MCG59 properly initialized with leapfrog service function. The specifics of leapfrog blocking approach is that you do not need to know in advance how many random numbers would be necessary for one path/chain. The parameter you need to provide to the function is a stride which is basically a number of threads you are going to use to parallelize your application.

Please let me know if you need more details.

Andrey

Аватар пользователя Ray

Hi Andrey,

Thanks for your reply and the link explaining the specific nature of certain seeds like 777. I can initialize the streams with a different seed to get widely dispersed starting values. I will proceed now based on your advice on the following matter. For a Markov chain decisions to make a move to the next sampling point is made by the use of ONE random number, based on a ratio of probabilities of the previous to the current sampling point. What scheme would you recommend I use to generate ONE random number at a time, for each independent sequence? Also, I intend to use openMPI or something like it to run the code in parallel on at least 56 cores on a cluster.

This is a most useful thread. Many thanks in advance for your help.

Ray

Аватар пользователя Andrey Nikolaev (Intel)

Hi Ray,

I would not recommend to use Intel(R) MKL RNG in a scalar mode, that is when vector size is one (see, for example, the graph http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/vsl/functions/uniform_cont.htm which shows how performance of the generator depends on vector size). In other words, to benefit from performance of Intel(R) MKL RNGs you need to generate array r[] of at least several hundreed random numbers:

vdRngUniform( method, stream, n, r, a, b );

 Once random numbers are available in array r you can process them in vector way:

for ( i = 0; i < n; i++ ) {

   if ( r[i] < p[k-1] / p[k] ) { chain[j] = state[k]; j++; k++; ...}; // ratio of p[k]'s can be computed in advance

}

If you need N chains (say, N > Ncores) split them between cores by assigning ~ N / Ncores chains to each core. In its turn, each core should be supplied by source of random numbers, which are used to generate those chains. One possible option is to use Ncore streams of MT2203, each of them would serve a different core (maximal number of streams supported by MT2203 is ~6K). Use of random numbers and blocking technique described above you would be able to generate chains, one by one on each core, and all cores would produce blocks of the chains in parallel.

Of course, there should be more options to parallelize your MCMC code, choice of the most suitable would be defined by requirements and details of your app, for example

- would number of nodes/cores and/or number of paths increase in future?

- do you expect the the output would not change numerically when you add additional cores/nodes to your cluster?

- which properties (in particular, period) do you need from source fo random numbers?

- etc...

Please, let me know, if this helps or more details are necessary to discuss.

Andrey

Аватар пользователя Ray

Hi Andrey,

From what I understand then, I should pre-initialize the random numbers for each stream (one stream per Markov chain) in an array, and then use these numbers in each step of the MCMC chain. I do indeed intend to use Ncore streams of the Mersenne Twister. It is just that I will have to be a little smart about how many numbers per stream (where Ncores = Nstreams = N_parallel_chains) to initialize in each chain, because I have more than one probable move per step - and sometimes the ratio does not need to be calculated if the proposed candidate sample is outisde the prior probability bounds. However, this is not a major issue as I can safely calculate the maximum number of random numbers per chain (~10^7) I will need, and can have them pre-initialized in an array (per chain) as you suggested.

About the questions you raise:

1) I do not expect the number of cores to go up significantly beyond 96 cores

2) I do not expect the output to change significantly on adding more than 96 cores

3) I am not sure I need an extremely large period, as this is not purely random Monte Carlo sampling, but Markov Chain Monte Carlo sampling which is a directed random walk, the directions being given by a model likelihood function. If the starting models are dispersed widely enough, then a period as low as 10^7 (per chain) is fine for me.

Thanks,

Ray

Аватар пользователя Andrey Nikolaev (Intel)

Hi Ray,

Based on the additional details you provided MT2203 streams appear to be reasonable way to start parallelization of your MCMC. I wonder if you would able to apply the blocking similar to what I described earlier? Also, another question is about number of chains - what are the reasons you correlate number of MCMC chains and number of cores, Ncores =  N_chains? What if number of chains you would need is much greater than number of available cores?

Andrey

Аватар пользователя Ray

Hi Andrey,

Perhaps the blocking technique will be useful, but I just tried the separate streams approach you outlined above (pre-initialize the random numbers before MCMC): Separate streams are quite easy to implement. For each chain, I implemented 2 separate streams, one with vRngUniform, and the other with vRngGaussian, as I needed both uniform and normally distributed Gaussian numbers (see code below)

I correlate the number of chains and number of cores (N_cores=N_chains) because we're working on massive clusters where a lot of cores are available at any one time. Also, we're working with geophysical problems where one independent chain on one core takes a day to finish. Running more than one chain on one core will take too long, I suspect.

Thanks for all your help,

Ray

subroutine init_intel_random(brng,seed,nProcs,streamU,streamN,methodU,methodN,rU,rN)
integer, intent(in) :: nProcs
real(RealPrec), intent(out), dimension(:,:) :: rU, rN
integer, intent(out) :: brng,seed,methodU,methodN 
TYPE (VSL_STREAM_STATE), intent(out), dimension(nProcs) :: streamU, streamN
integer :: loopCount, nRandsU, nRandsN
integer(kind=4) :: errcode
brng=VSL_BRNG_MT2203 !starting stream
methodN=VSL_RNG_METHOD_GAUSSIAN_ICDF !Method for Normal variates
methodU=VSL_RNG_METHOD_UNIFORM_STD !Method for Uniform variates
seed=1
nRandsU = size(rU,1) !Number of Uniform variates
nRandsN = size(rN,1) !Number of Normal variates
!for the Uniform random variables
do loopCount = 1,nProcs
!Initialize nProc Streams
errcode = vslnewstream( streamU(loopCount), brng + loopCount-1, seed )
call CheckVslError(errcode)
!Initialize in each column of rU, nRandsU uniform variates
errcode=vdrnguniform( methodU, streamU(loopCount), nRandsU, rU(:,loopCount), 0d0, 1d0)
call CheckVslError(errcode)
end do
!not the smartest way, but least confusing
!now shift the stream number nProcs higher for the Normal Random Variates
brng=VSL_BRNG_MT2203 + nProcs
!for the Normal random variables
do loopCount = 1,nProcs
!Initialize the next nProc Streams
errcode = vslnewstream( streamN(loopCount), brng + loopCount-1, seed )
call CheckVslError(errcode)
!Initialize in each column of rN, nRandsN standard normal uniform variates
errcode=vdrngGaussian( methodN, streamN(loopCount), nRandsN, rN(:,loopCount), 0d0, 1d0)
call CheckVslError(errcode)
end do
end subroutine init_intel_random

Аватар пользователя Andrey Nikolaev (Intel)

Hi Ray, it is good to know that you implemented separate stream approach. Please, let me know when/if further help on RNGs/other Intel(R) MKL functionality is necessary from our side. Andrey  

Зарегистрируйтесь, чтобы оставить комментарий.