I'm using parallel_reduce with default auto_partitioner and it starts large chunck of work which takes longer than chunks in other threads (red bands on picture). Is there way to adjust auto_partitioner to start with smaller block?
If you want roughly equal-size chunks, use simple_partitioner (with an appropriate grainsize).
Just out of curiosity, why is it a problem that simple_partitioner by design creates chunks with different sizes?
(Added) I don't see what the graph is supposed to show, but Robert seems to have an idea.
The graph you present seems to represent a classic case of load imbalance--it looks like one of the LoadFeaturesTasks did not get partitioned, so no worker threads could come in to steal parts of the work to help balance the load. Or perhaps it is the DatasetTask that is not getting properly partitioned and causing a delay because of a data dependence? The trace is somewhat revealing but doesn't supply some basic details that might help with its interpretation. Can you describe the general data flow and provide some details on the nature of the tasks represented and their relationships to each other?
Which version of TBB do you use? If older than 4.0, I recommend you to try a newer version, as in 4.0 the auto_partitioner was improved to better handle unbalanced workloads.
Thank you guys. Seems like I need to check whether there is possible imbalance or interlocks in LoadFeaturesTask (but it looks unlikely for me now).
The workflow is like this: RenderStage1Task spawns and waits for some DatasetTasks, DatasetTask calls parallel_reduce with LoadFeaturesTask as a body. Other tasks shown can be ignored. TBB is the 4.0 update 3 of 4.
If you suspect auto_partitioner, you should actually make the comparison with simple_partitioner, otherwise it's just an idle accusation.
I've had another look with the new information. You seem to show 8 hardware threads, with a summary at the bottom. Can you make a distinction between the lifetimes of the LoadFeaturesTask instances (not really an appropriate name if it is the Body of a parallel_reduce(), and it has apparently misled Robert), and their potentially several invocations, where the latter are more directly linked with tha actions of the partitioner (if I'm not mistaken)? Can you log the lengths of the executed chunks and divide by execution time to find out if something is special there?
So check your assumptions, and don't just assume that we can just guess what you mean and what the program does or is supposed to do.