The problem I'm facing is probably not direclty related to TBB but you might maybe already faced something similar.
Indeed, I've got very different performances for a same program when using it on two different machines. In order to test the performances of my freshly parallelized program I've got two machines : a classic 2 cores machine and a 48 cores machine (HP proliant DL585 G7).
With the first machine (2 cores) and with TBB configured for working with 2 threads the time spent on computing is 2x faster. So far so good. With the second machine I've got about 3 or 4 times slower with TBB configured for working with only 2 cores or the 48. I'm obliged to configure TBB for working with 32 cores to get a speed up of 2x !
(I've attached a picture of the processor architecture of the HP Proliant DL585 G7)
1) The HP Proliant DL585 G7 is made of 4 groups of 12 processors, each group have two local shared memories. I firstly tried to explain this slow-down because of the architecture, indeed I thought my program was accessing other local shared memories. To check this I used the "taskset" program to force my program to use only certain processors and I noticed only a slight speed up. But I'm not sure if I used "taskset" the right way so I think I have to dig this up more.
2) Each core of the 2 cores machine has 4mo cache memory, each core of the 4 groups of 12 of the 48 cores machines has only about 400ko, would that be a problem ? Because I am using a lot of memory, maybe I do not benefit from cache effect because of the size of the cache ?
3) Write/read accesses have been thought so that there are the less concurrency problems possible in my program. However I'm maybe suffering from "pointer aliasing". Indeed, a colleague of mine told me local alias could resolve some memory problems, but I'm not sure to undertand that (?).
Did you experienced somthing similiar and have you some recommandations to give me to resolve such problem ?