Slow performance compared to raw pthreads

Slow performance compared to raw pthreads

I am seeing some severe performance problems when compared to using raw pthreads. TBB is about 10% of the speed of pthreads. vmstat 1 shows that there are a lot of interrupts for TBB and hardly any for pthreads. The CPUs are used 100%. I'm perplexed as to why the same worker code runs at 10% of the speed when running under TBB. Any clues? I've tried altering the buffer sizes to see if it's a cache alignment effect but it makes no real difference. Single threading TBB using an explicit number of threads makes no difference either.

Here's the code:

#include
#include
#include
#include
#include
#include
#include
#include
#include

const uLongf bufSize = 256 * 1024;
const int numTasks = 16;

// Common worker code - soaks up CPU using zlib compression function
void * execute(void *) {
Bytef outbuf[bufSize], inbuf[bufSize];
tbb::tick_count start = tbb::tick_count::now();
for (int i = 0; i != 1000; i++) {
uLongf outLen;
compress(outbuf, &outLen, inbuf, bufSize);
}
tbb::tick_count end = tbb::tick_count::now();
std::cout << pthread_self() << " " << (end-start).seconds() << std::endl;
return 0;
}

struct Task {
void run() {
::execute(0); // run same worker code as pthreads
}
};

struct TaskRunner {
void operator()(Task & task) const {
task.run();
}
};

std::vector tasks(numTasks);

struct BlockRunner {
void operator()(const tbb::blocked_range & range) const {
for (int i = range.begin(); i != range.end(); i++)
tasks[i].run();
}
};

int main()
{
#if 1
tbb::task_scheduler_init init;
tbb::parallel_do(tasks.begin(), tasks.end(), TaskRunner());
#elif 0
tbb::task_scheduler_init init;
tbb::parallel_for(tbb::blocked_range(0, numTasks, numTasks/2), BlockRunner());
#else
pthread_t t[numTasks];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i != numTasks; i++) {
pthread_create(&t[i], &attr, execute, 0);
}
for (int i = 0; i != numTasks; i++) {
void * p;
pthread_join(t[i], &p);
}
pthread_exit(0);
#endif
}

The outputs of "g++ -O3 test.cpp -ltbb -ltbbmalloc -lpthread -lrt -lz && time ./a.out" are shown below for each of the three implementations in main().

// parallel_do
3076811632 3.74102
3077867744 3.76923
3076811632 3.77188
3077867744 3.75799
3077867744 3.75111
3076811632 3.76593
3077867744 3.74415
3076811632 3.79351
3077867744 4.00687
3076811632 4.12728
3077867744 3.85296
3076811632 4.02766
3077867744 3.97775
3076811632 3.87468
3077867744 3.78757
3076811632 3.73723

real 0m30.846s
user 0m58.374s
sys 0m1.188s

// parallel_for
3076434800 3.80896
3077490912 3.96733
3076434800 4.05173
3077490912 4.12096
3076434800 3.84406
3077490912 3.75456
3076434800 3.85637
3077490912 4.05792
3076434800 3.79944
3077490912 3.94887
3076434800 3.81814
3077490912 3.88321
3076434800 3.81466
3077490912 3.80893
3076434800 3.82112
3077490912 3.76532

real 0m31.312s
user 0m58.344s
sys 0m1.052s

// pthreads
3062643568 3071036272 0.0142380.0107834

3029334896 0.0335929
2962193264 3020942192 0.0464237
0.00820379
3039820656 0.0555536
3012549488 0.0607128
2978978672 0.0338541
2970585968 0.0464155
2987371376 0.0564107
2953800560 0.0664894
3052403568 0.103507
2945407856 0.0780483
2995764080 0.13502
3004156784 0.159363
3079428976 3.80797 // hmm, interesting that the last one takes the same time as under TBB

real 0m3.811s
user 0m3.731s
sys 0m0.182s

Platform details:
Intel T5900 Core 2 Duo 2.2GHz, 2MB cache
Fedora 14
3GB RAM
g++ 4.5.1
32-bit kernel 2.6.35.11-83.fc14.i686.PAE
TBB tbb-2.2-2.20090809.fc14.i686

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The problem appears to be linked to the g++ -O3 optimisation switch. If I remove this switch and also replace the call to zlib's compress function with a plain busy loop (increment a local variable 10^9 times) then the pthread version is the same speed as either of the TBB versions. Hmm, very interesting....

Leave a Comment

Please sign in to add a comment. Not a member? Join today