The recursive task-based parallelism of TBB is usually working very well for me. However, occasionally there is a situation where a simple static split of the work into n_thread equal-sized parts, one per thread, is beneficial
I currently have such a situation. In step 1, a simple range-based parallelism is used to construct thread-local oct-trees from a list of positions. In steps 2&3, these thread-local trees are merged and linked to the final data product (all steps use tbb parallelism). In this process, an considerable speed-up (~factor 2) is obtained (in serial and parallel) if the initial positions are already ordered close to the tree order (in practice I use the order from the previous time step). Steps 2&3 (and perhaps also step 1), should benefit from an even better affinity, such as would emerge if in step 1 each thread a works on a contiguous partition of all positions.
How could/should I implement this in TBB. I was thinking about creating a range which splits off N_initial/n_thread until N<=n_thread, when it reverts to ordinary splitting down-the-middle. Would this give me the correct behaviour? In these initial splits, which side should keep the larger portion: the newly created or the old range?
Alternatively, I could spawn n_thread root tasks? would this be better (in the sense of faster distributing the work evenly amongst the threads as intended)?