I'm a fresh new user of TBB. And I am currently trying to make a code parrallel in order to make it works faster on a 48cores machine.
I'm currently working on the parallelization of a sequential code with TBB. But the parallelized version of the code is way much slower .... I don't know if it is because of the size of the data I use (which could be too small) but I really doubt it, because I tried with other data and I always got very poor performances (compared to the sequential code).
I chose two ways to implement my code with TBB : by using a parallel_reduce and by using a parallel_for. I get slightly better performance with the parallel_for of course (but it is still too much slow when compared to the sequential code).
Information about the two version I implemented :
For the parallel_reduce version no data is shared except fews references (no arrays are copied) and the resutls are merged
in the join method (the merging cost almost nothingbecause I'm using std::list).
For the parallel_for version plenty of references to arrays are shared and then I am obliged to used tbb:concurrent_vector to avoid conficts.
Time spent :
I manually chose the number of threads/cores to utilize,and after a quick check I noticed I get the best performances with only one thread/core (over 48 cores !) but the time spent with one core is still way more slower than the original sequential version of the code....
Typically I spent 7ms with the original sequential version, 17ms with the parallel_reduce version and 12ms with the parallel_for version....
Classes and codes :
Here are some parts of code to help you find the problem :
For the parallel_reduce version this is my structure :
const float c;
std::list > v;
std::list > n;
std::vector* > &nPointers;
std::vector* > &xVPointers;
std::vector* > &yVPointers;
std::vector* > &zVPointers;
/* optimization */
unsigned char lookupTableEntry;
unsigned char case;
unsigned char config;
unsigned char subConfig;
parallelReduceVersion(CenteredGrid3D &data, float c,
std::vector* > &nPointers,
std::vector* > &xVPointers,
std::vector* > &yVPointers,
std::vector* > &zVPointers)
mx = data.geometry().nx();
my = data.geometry().ny();
mz = data.geometry().nz();
parallelReduceVersion(const parallelReduceVersion& smc, tbb::split)
: data(smc.data), c(smc.c),nPointers(smc.nPointers),xVPointers(smc.xVPointers),yVPointers(smc.yVPointers),zVPointers(smc.zVPointers)
mx = smc.data.geometry().nx();
my = smc.data.geometry().ny();
mz = smc.data.geometry().nz();
/* methods */
void operator()(const tbb::blocked_range& r);
void join(parallelReduceVersion &smc)
parallel_reduce is used like that :
parallelReduceVersion PR = new parallelReuceVersion(data, c, nPointers, xVPointers, yVPointers, zVPointers);
the () operator :
void parallelReduceVersion::operator()(const tbb::blocked_range& r)
for(k = r.begin() ; k < r.end() && k < mz ; k++)
for(j = 0 ; j < my ; j++)
for(i = 0 ; i < mx ; i++)
for(k = r.begin() ; k < r.end() && k < mz-1 ; k++)
for(j = 0 ; j < my-1 ; j++)
for(i = 0 ; i < mx-1 ; i++)
Do you think I am using TBB the right way for best performances ?