Hi,I have been experimenting with the "wavefront" parallel design pattern, as presented in section 11.6 of the Tutorial. The task graph that my program builds and evaluates is that of Figure 11. I am trying to emulate a real computation by having each task receive data from its left and upper neighbours, execute a quite small workload (100 reg-to-reg FP operations), and then notify its right and lower neighbours by means of decreasing their ref. counters.I have implemented the following versions of this pattern, all based on the low-level task api:a) "baseline": this is exactly the one demonstrated in section 11.6, which builds statically the task graph and then evaluates it dynamicallyb) "dynamic": this version dynamically allocates (as roots) and spawns new tasks, each time their refcount becomes zero. I had to use atomic counters instead of tbb::task's native ref_count's, because, apparently, the task objects have not been created by that time.c) "dynamic+reuse": the same as 'b', and additionally tasks are recycled using recycle_as_continuation()On a 8-core Nehalem system I get the following times at 8 threads, when running a 10000 x 1000 graph (i.e. total of 10M tasks):a -> 2.42 secb -> 0.89 secc -> 0.26 secI verified that ~75% of the time of (a) is spent on allocation of tasks, which happens sequentially. This allocation occurs concurrently in cases (b) and (c), and additionally seems to scale pretty well. So I'm guessing that the library uses internally scalable allocators on each of these allocate_* task creation calls. Is this the case?
For more complete information about compiler optimizations, see our Optimization Notice.