Generating thread safe random number generator for TBB

Generating thread safe random number generator for TBB

Hello,

I have a function object for parallelizing a for_each() algorithm using Thread Building Blocks,The function object uses a random number generator RND whose operator method () generates a UNIFORM RANDOM number between 0 and 1

Problem: I need a random number number generator to

1) initialize only once in the function object

2) should be threadsafe and

3) can be provided same seed so that same results could be obtained.

I dont know much about generating thread safe random number generators in function objects as such. I tried using my own random generator class (using engine, distribution and generators) (using boost libraries) but I need something simple such as erand() ? or something like that for which we dont need to do write separate code.

 

    struct func {
     public:
      func() {  }
      func(int t_) : t(t_) {  }  

     template <typename house_t>
     void operator()(house_t  house)  const  {

       if ( RND() )
       {  //do something with the house }

     } //operator

     private:
       int t;
     };


     //Parallel for_each
     tbb::parallel_for_each( house.begin(), house.end(), func(t) );

 

As per my understanding of TBB, the function object is used by every thread (as and when it gets scheduled by the task scheduler), however, if the seed of that random number generator is same, should it produce same results ? Can anyone provide or suggest any such random generator

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If you won't be able to reproduce the distribution of work over threads, or the mutual timing of the threads, probably the best you can do is share a generator and get the same overall set of values, but would that be enough for your purposes? Also, you'd have to disable it for production, because sharing that generator would not be scalable.

Hi Raf,

To share the generator amongst the threads, I wrote the following code with two possible options 1. erand(48)  and 2. BOOST . Here is the code, (it contains both the approaches.) Can you please tell which is is correct or more reliable ?

 

#include <boost/random/variate_generator.hpp>
#include <boost/random/uniform_real.hpp>
#include <boost/random/mersenne_twister.hpp>

#include <cstdlib>

class uniformReal {
  public:
    //Approach 1 
    typedef boost::mt19937                          ENG; //engine
    typedef boost::uniform_real <>                  UDIST; //distribution
    typedef boost::variate_generator <ENG, UDIST>  UGEN; //generator

    uniformReal (int a, int b) : engine(100), range(a,b), generator(engine, range)  //fixed seed 100 (can change)
        {  
           //approach 2 
           initialize();
        }

inline double operator()()  {
    return erand48(xsubi);      // APPROACH 2
    return generator();              // APPROACH 1
 }

 void initialize()  {
   struct timeval t;
   gettimeofday(&t, 0);
   xsubi[0] = 1;   //for fixed seeding
   xsubi[1] = 2;
   xsubi[2] = 3;
    
   //xsubi[0] = time(0);  //for random seeding
   //xsubi[1] = t.tv_usec ^ t.tv_sec;
   //xsubi[2] = t.tv_usec ^ t.tv_sec * time(0);
  }

private:
 //Approach 1
 ENG   engine;
 UDIST range;
 UGEN  generator;

 short unsigned int xsubi[3];  //APPROACH  2
};




In my function object, I instantiate the generator as follows:

#include "my_random_class.h"  //above class

struct func {
 public:
  func() {  }
  func(int t_) : t(t_), RND(0,1)  {  }   

 template <typename house_t>
 void operator()(house_t  house)  const  {

   double p = probability(house);

   if ( p > RND() )
   {  //do something with the house }

 } //operator

 private:
   uniformReal  RND;
   int t;
 };

 

 

In the above code, if a computed probability function for house object, is greater than RND(), then do something in the loop. The RND is now shared amongst the threads. Is this valid ? The reason I ask is, I fixed the seed (for above both random generators) and I did a parallel_for_each() for my container. The result was same as for that of serial (1thread) or (std::for_each)   implementation.  Although the generator is shared now amongst the threads, is it a coincidence, that I get the same result for serial and parallel ? Or the calling of RND() is serialized in this case, so it affects performance,

Here are my performance numbers for 6 core Xeon processor for 1-6 threads.

threads
time (sec)
speed up

1
267.68
1

2
136.08
1.9670781893

3
91.93
2.9117807027

4
68.97
3.881107728

5
57.07
4.6903802348

6
48.57
5.5112209183

 

For this, the result is fairly good,

 

My only concern now is correctness - i.e. if 1 thread and 4 thread implementations yield same result, is the random number implementation correct? Is it call to operator()()  serialized ? If yes, how do I get over it?

If you would please help me out in this, that would really be great.  If you need any more information, I will be happy to provide.

Also, could you please explain why sharing the generator is not scalable  ? and how to get around with it ? I have 2 solutions in mind (a) supply a container full of random numbers and store it in the shared function object as reference. let each thread use one random number.
and (b) in the operator()()  define your own temporary random number generator --> this would affect performance I believe in the long run. (Any other ideas ?)

 

thanks a lot

Aniket

 

threads    time    speed up
1          267.68    1
2          136.08    1.9670781893
3           91.93    2.9117807027
4           68.97    3.881107728
5           57.06    4.6903802348
6            48.57    5.5112209183

updated performance number table

If each set of related boost random instances is only ever accessed by one thread, or similarly synchronised, then that is enough. There is no global common state that would prevent such use (or it is internally protected if there should be, but that seems unlikely).

As for reproducibility, if each thread starts from a known seed, then the mutual order of the thread executions does not matter.

So if you have parallel_for_each() where each item gets its own random data from a fixed seed, you're fine. If you have a generator that is used by different tasks in a parallel_for_each() or a different algorithm, however, then you have to use a mutex for correct operation, but you won't have reproducibility anymore. The former seems to be the case here.

If you use fixed seeding in each thread, you have to take into account that there might be interactions between the generated data, so you might want to start with a distribution of seed data from a separate generator to each parallel_for_each() element, probably before parallel_for_each() is invoked.

In the situation as I understand it now, it would be bad to share a generator, for reproducibility as well as scalability. Scalability would be hurt because all generating work has to be serialised, so unless you can have one thread fill a queue of random data that is cheaply distributed to its consumers, the threads have to wait for each other. Furthermore, synchronisation itself has a cost even without contention. Reproducibility would be lost compared to the situation where each parallel_for_each() element has its own generator.

tbb::parallel_for_each may simultaneously apply the same function object to multiple items. I.e. in the code above it's possible that multiple calls to RND() are performed simultaneously. Thus, the implementation is as thread-safe as its underlying random number generator - erand48() or boost::variate_generator. You need to read corresponding documentation to know their thread safety guarantees. I tend to think that typically pseudo-random number generators have some modified state; but I am not an expert in RNG so take it with a grain of salt please, and do your own search. If this is correct, than either it's not thread-safe or there is internal synchronization; perhaps the latter, i.e. calls to RNG() are likely serialized (maybe partially). 

What's perhaps worse for you is that, as the same generator is shared between threads, the implementation is not deterministic. I think the results for single-threaded and multi-threaded runs being the same is a coincidence.

If determinism is important, you need to make sure that each item processed by parallel_for_each() always receives the same pseudo-random number. There are several ways to do that, and you mentioned some before:

  • pre-generate all the random numbers before calling parallel_for_each, and make sure there is a deterministic 1-1 correspondence between these numbers and the items to process. Note that sharing a reference to the generated numbers in the function object won't give you that deterministic correspondence because you do not know in which order items are processed. You need to pair a number and an item together somehow (a container of pairs, a zipped iterator on two containers, storing the number inside the item, using a deterministic hash function to find the index of the right number, etc.).
  • create an instance of RNG locally inside functor's operator(). For determinism, you need fixed seeds; for good distribution, you need separate seeds for each instance - so you now need a deterministic method to produce a different seed for each item. And even in that case, is it guaranteed that multiple instances together produce uniformly distributed numbers? Separate RNG instances are good for performance, but the degree of correctness is unclear to me.
  • there is another solution I can suggest: a two-stage pipeline. It's funny but this is the 4th time I suggest it today at the forum to solve different problems :) In this case, the serial input stage would use the same RNG instance to produce a random number, take an item out of the container, tie them together into e.g. std::pair, and pass down the pipeline. With a fixed RNG seed, deterministic behavior is guaranteed - same numbers will always be matched to same items. The second parallel filter would do all the processing, comparing the computed probability to the given random number. However, whether this solution will perform as well as parallel_for_each is a question. In particular, if the container to iterate over supports random access (e.g. std::vector or std::deque), parallel_for_each may take advantage of that but parallel_pipeline may not.

To conclude, it seems a certain degree of serialization (and so performance and scalability) is the cost you need to pay for determinism. You should decide which one to take. At a glance, I would question the need for absolute determinism from run to run; since it uses pseudo-random numbers anyway so the results is [pseudo-]random, would not it be sufficient if it is statictically deterministic by some definition, such as approximately the same number of items passing the probability check, or some such?

Having determinism in a non-determinant system is an oxymoron.

Alexey's suggestion of using pre-generated arrays of random numbers for each iteration of a parallel_for is workable, provided that the parallel_for's are not nested within a non-determinate parallel task. Example: if the number of task is dependent on the number of available threads and progress of those threads, the number of and (recursive) nest levels of the parallel_for's is non-deterministic, therefore you cannot have and use the same pre-generated set of random numbers (as you will have differing numbers of sets).

You will have to determine if the requirements for all circumstances (not just a few test runs). This will require writing a proof.

Good suggestions Alexey

Jim Dempsey

www.quickthreadprogramming.com

I'm not sure where we agree or disagree anymore, and I may have incorrectly assumed that each element's generator was going to be used for multiple random numbers (if not then it is not very efficient to give each element its own generator), but I have an additional suggestion to speed up the deterministic distribution of random data, either as the final product or as the seed for each element's generator. (Using a two-stage pipeline as suggested by Alexey is less efficient because it uses one task for each element.)

First generate number_of_elements/number_of_elements_at_once random values from an initial generator, and then use parallel_for to use each of these values as the seed for a new generator that in turn initialises number_of_elements_at_once elements.

Some determinism can be rescued from the jaws of non-deterministic parallel execution (even TBB provides parallel_deterministic_reduce()). You just have to be a little bit careful...

Leave a Comment

Please sign in to add a comment. Not a member? Join today