I am using TBB for a numerical computing application. I have an outer for loop that runs sequentially, and an inner loop that is parallelized using TBB's parallel_for(). This inner loop does most of the work of the application. I recently made a change where a work intensive function gets called one more time within the parallel_for loop. When the application starts, work is properly distributed across all the available cores at close to 100% utilization. After the first iteration, however, the application seems to run serially keeping only one core active at close to 100%. Before my recent code change, I noticed this behavior during test runs sometimes; I never understood exactly why it was happening but I would tweak the grain size or the number of iterations in order to get it to use all cores. Now, my test runs always use all the cores through all iterations, but full executions drop down to one core after the first iteration everytime. I have experimentally tested many grain sizes, not setting grain size, different number of iterations, setting the number of processors as an argument to task_scheduler_init, using auto_paritioner(), using simple_paritioner(), etc, etc. I tested the application w/o the recent code change and full executions complete using all cores close to 100%. There is nothing unusual about this function call (that I know of) - it is called several times within the parallel_for loop; the only change is it is now called one more time which seems to have disrupted my delicate workload distribution balance. I have this application running on Linux and Windows and get the same behavior. I can't seem to find the magic combination to keep all cores active after the first iteration. Any suggestions are greatly appreciated.