My TBB code is slower than the sequential code.

My TBB code is slower than the sequential code.

azmodai's picture

Hello,

I'm a fresh new user of TBB. And I am currently trying to make a code parrallel in order to make it works faster on a 48cores machine.

I'm currently working on the parallelization of a sequential code with TBB. But the parallelized version of the code is way much slower .... I don't know if it is because of the size of the data I use (which could be too small) but I really doubt it, because I tried with other data and I always got very poor performances (compared to the sequential code).

I chose two ways to implement my code with TBB : by using a parallel_reduce and by using a parallel_for. I get slightly better performance with the parallel_for of course (but it is still too much slow when compared to the sequential code).

Information about the two version I implemented :
------------------------------------------------------------

For the parallel_reduce version no data is shared except fews references (no arrays are copied) and the resutls are merged
in the join method (the merging cost almost nothingbecause I'm using std::list).

For the parallel_for version plenty of references to arrays are shared and then I am obliged to used tbb:concurrent_vector to avoid conficts.

Time spent :
----------------

I manually chose the number of threads/cores to utilize,and after a quick check I noticed I get the best performances with only one thread/core (over 48 cores !) but the time spent with one core is still way more slower than the original sequential version of the code....

Typically I spent 7ms with the original sequential version, 17ms with the parallel_reduce version and 12ms with the parallel_for version....

Classes and codes :
-------------------------

Here are some parts of code to help you find the problem :

For the parallel_reduce version this is my structure :

struct parallelReduceVersion
{
int mx;
int my;
int mz;

CenteredGrid3D &data;
const float c;

std::list > v;
std::list > n;
std::list
t;

std::vector* > &nPointers;
std::vector* > &xVPointers;
std::vector* > &yVPointers;
std::vector* > &zVPointers;

/* optimization */
int x,y,z;
float a[8];
unsigned char lookupTableEntry;
unsigned char case;
unsigned char config;
unsigned char subConfig;

parallelReduceVersion(CenteredGrid3D &data, float c,
std::vector* > &nPointers,
std::vector* > &xVPointers,
std::vector* > &yVPointers,
std::vector* > &zVPointers)
: data(data),c(c),nlPointers(nPointers),xVPointers(xVPointers),yVPointers(yVPointers),zVPointers(zVPointers)
{
mx = data.geometry().nx();
my = data.geometry().ny();
mz = data.geometry().nz();
}

parallelReduceVersion(const parallelReduceVersion& smc, tbb::split)
: data(smc.data), c(smc.c),nPointers(smc.nPointers),xVPointers(smc.xVPointers),yVPointers(smc.yVPointers),zVPointers(smc.zVPointers)
{
mx = smc.data.geometry().nx();
my = smc.data.geometry().ny();
mz = smc.data.geometry().nz();
}

/* methods */
.....
.....
.....
.....

void operator()(const tbb::blocked_range& r);

void join(parallelReduceVersion &smc)
{
v.splice(v.end(),smc.v);
n.splice(n.end(),smc.n);
t.splice(t.end(), smc.t);
}
};

parallel_reduce is used like that :

parallelReduceVersion PR = new parallelReuceVersion(data, c, nPointers, xVPointers, yVPointers, zVPointers);
tbb::parallel_reduce(tbb::blocked_range(0,nz,round(nz/nbCores)), *PR);

the () operator :

void parallelReduceVersion::operator()(const tbb::blocked_range& r)
{
for(k = r.begin() ; k < r.end() && k < mz ; k++)
{
for(j = 0 ; j < my ; j++)
{
for(i = 0 ; i < mx ; i++)
{
// CODE
}
}
}
for(k = r.begin() ; k < r.end() && k < mz-1 ; k++)
{
for(j = 0 ; j < my-1 ; j++)
{
for(i = 0 ; i < mx-1 ; i++)
{
// CODE
}
}
}
}
}

Do you think I am using TBB the right way for best performances ?

Thanks!

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
azmodai's picture

Ok....
Do you think it is a good idea to have plenty of methods in the class ? Do you think I should restrict the number of methods the more possible to avoid stack calls ?

I'm performing a lot of push_backs in local on several std::lists, when I comment the push_back I've got get performances however I'm pretty sure I have got similar poor performances using classic arrays ... so dead end

Login to leave a comment.