Is there any documentation available to help with troubleshooting performance issues when using TBB? (I haven't yet got the O'Reilly book, so that might be the short answer.)
I'm trying to use parallel_for and auto_partitioner to parallelise some existing calculation code under Windows XP. The modified code runs fine and Task Manager shows that all cores are being used throughout the calculation, but the runtime is slightly longer than the single-threaded version (the PC has two cores- one core plus hyperthreading).
Running under a profiler and looking at the wait_for_all function, roughly half the time is being spend inside my body function but half is "inside" the windows Sleep function, which is being called 200,000 of times compared to 93 calls to my body function.
The range is not completely uniform: it has 252 elements (taking just over a second) of which one block of 145 elements take 95% of the time.
Changing allocator, grainsize, partitioner and debug/release build doesn't seem to affect these relative times much, and DO_TBB_ASSERT doesn't flag any problems. Even runningon a four core PC gives no speed-up. I've generated a TBB_TRACE file but not sure how to interpret the output.
Any pointers welcome. I suspect the answer is (a) read the book and (b) build up / reduce to a simple example, but it's frustrating being so close and so I hoped someone might recognise the symptoms and be able to tell me what I've done wrong.
Thanks
Bryan




