Automatic Chunking

A parallel loop construct incurs overhead cost for every chunk of work that it schedules. Since version 2.2, Intel® Threading Building Blocks (Intel® TBB) chooses chunk sizes automatically, depending upon load balancing needs.[1] The heuristic attempts to limit overheads while still providing ample opportunities for load balancing.


Typically a loop needs to take at least a million clock cycles to make it worth using parallel_for. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

The default automatic chunking is recommended for most uses. As with most heuristics, however, there are situations where controlling the chunk size more precisely might yield better performance.

[1] In Intel TBB 2.1, the default was not automatic. Compile with TBB_DEPRECATED=1 to get the old default behavior.

For more complete information about compiler optimizations, see our Optimization Notice.