"Preprocessing usually constitutes overhead by itself" That is not the situation here: we're assuming clearly clustered workloads (the parallel_for serves to enumerate over distinct tasks in the original sense of the word), and we want to reserve the small tasks to fill in the gaps, like paving a road by first dropping boulders and then pebbles on top (it's a flawed simile, of course), to minimise idle time at the barriers. My comment to Robert was that it does not serve to first drop the boulders, have a barrier, then drop the pebbles and have another barrier, because that will be worse than dropping everything at once (which is what TBB does now), which in turn is worse than the solution suggested above (which TBB does not yet provide).
Quoting - Raf Schietekat
Ok, let me explain a bit, new tasks get placed/constructed directly in the correct "bin" with no overhead (As Raf guessed correctly). So there is no runtime penalty in my case, but then again I had full control over the design of the tasks and scheduler. So the preprocessing doesn't affect me, but it would be a fundamental factor to consider in a generic task scheduler.
Also my custom system proved actually more robust than tbb's parallel_for when I benchmarked them a while ago. Basically I threw random loads at parallel_for and my sytem, and yes tbb outperformed my system many times, but it occasionally also really lost, while my system takes a consistent time for the same tasks (because itreducesrandomness). So I chose my system its robustness.
However one very important accidental factor lies in my design, and that is that I did most of the foundational work and engineering for the system while prototyping with OMP, and not tbb. It was after the system was working and scalable, that I got around to trying tbb. So I had already engineered the clustering part by the time I began introducing tbb. Now if I were to begin once again from a blank slate, I'd likely have just used parallel_for and not invested the extra time required in designing clustering algorithms, and fine tuning variables, because indeed the difference isn't that big and clustering doesn't provide me great speedups when compared with parallel_for (but clustering is much faster than a simple OMP for loop). So I could say designing such a careful clustering algorithm is not cost-effective. But if its already there, then well its cheap :)