A parallel loop construct incurs overhead cost for every chunk of work that it schedules. Since version 2.2, Intel® Threading Building Blocks (Intel® TBB) chooses chunk sizes automatically, depending upon load balancing needs. The heuristic attempts to limit overheads while still providing ample opportunities for load balancing.
Typically a loop needs to take at least a million clock cycles to make it worth using parallel_for. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.
The default automatic chunking is recommended for most uses. As with most heuristics, however, there are situations where controlling the chunk size more precisely might yield better performance.
 In Intel TBB 2.1, the default was not automatic. Compile with TBB_DEPRECATED=1 to get the old default behavior.