Unfair auto_partitioner

Unfair auto_partitioner

Bild des Benutzers olegmax

I'm using parallel_reduce with default auto_partitioner and it starts large chunck of work which takes longer than chunks in other threads (red bands on picture). Is there way to adjust auto_partitioner to start with smaller block? 




Herunterladen 1.png30.1 KB
6 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Raf Schietekat

If you want roughly equal-size chunks, use simple_partitioner (with an appropriate grainsize).

Just out of curiosity, why is it a problem that simple_partitioner by design creates chunks with different sizes?

(Added) I don't see what the graph is supposed to show, but Robert seems to have an idea.

Bild des Benutzers robert-reed (Intel)

The graph you present seems to represent a classic case of load imbalance--it looks like one of the LoadFeaturesTasks did not get partitioned, so no worker threads could come in to steal parts of the work to help balance the load.  Or perhaps it is the DatasetTask that is not getting properly partitioned and causing a delay because of a data dependence?   The trace is somewhat revealing but doesn't supply some basic details that might help with its interpretation.  Can you describe the general data flow and provide some details on the nature of the tasks represented and their relationships to each other?

Bild des Benutzers Alexey Kukanov (Intel)

Which version of TBB do you use? If older than 4.0, I recommend you to try a newer version, as in 4.0 the auto_partitioner was improved to better handle unbalanced workloads.

Bild des Benutzers olegmax

Thank you guys. Seems like I need to check whether there is possible imbalance or interlocks in LoadFeaturesTask (but it looks unlikely for me now). 

The workflow is like this: RenderStage1Task spawns and waits for some  DatasetTasks, DatasetTask calls parallel_reduce with LoadFeaturesTask as a body. Other tasks shown can be ignored. TBB is the 4.0 update 3 of 4.


Bild des Benutzers Raf Schietekat

If you suspect auto_partitioner, you should actually make the comparison with simple_partitioner, otherwise it's just an idle accusation.

I've had another look with the new information. You seem to show 8 hardware threads, with a summary at the bottom. Can you make a distinction between the lifetimes of the LoadFeaturesTask instances (not really an appropriate name if it is the Body of a parallel_reduce(), and it has apparently misled Robert), and their potentially several invocations, where the latter are more directly linked with tha actions of the partitioner (if I'm not mistaken)? Can you log the lengths of the executed chunks and divide by execution time to find out if something is special there?

So check your assumptions, and don't just assume that we can just guess what you mean and what the program does or is supposed to do.

Melden Sie sich an, um einen Kommentar zu hinterlassen.