We are trying to use Parallel_For for a loop that gets called many, many times. We implemented it and we are now pegging a 4 cores of a Quad PC !! (Intel S5000VSA motherboard of course), but it's 10 times slower !
Using VTune I see under :
Threads (Inside OurExecutable.exe)
Thread Process Timer%
Thread131 OurExecutable.exe 52.93%
Thread125 OurExecutable.exe 46.52%
Modules (In either above Threads)Below is one of them
Module Process Timer%
tbb.dll OurExecutable.exe 65.31%
OurDLL.dll OurExecutable.exe 17.65%
Seems like most of the time is inside TBB.dll ?
Any thoughts ? Maybe we are trying to parallelize a loop that is already very tight yet it's called many many many times.
We were hoping we could optimize using parallel_for, but maybe we are not using it right or we are not implementing it correctly.
Any help would be greatly appreciated.