Grainsize - Director's Cut

The parallel loop templates in Intel® TBB require a grainsize parameter. Ideally, we'd have some sort of profile-guided optimization. But that's tough to do within TBB's goal of working with standard-issue compilers.

It's really not that difficult to understand and set. I had this analogy in a draft of the Tutorial, but it ended up on the cutting room floor because it depended too much on understanding the Western culture's lifestyle:

The grainsize enables you to avoid excessive parallel overhead. A parallel loop construct involve "packaging and handling" overhead that is paid per subrange. If the subranges are too small, this overhead may dominate the useful work. By specifying a grainsize, you can keep the overhead reasonable. If you ever bought something inexpensive by mail order, you probably encountered the problem that the "packaging and handling" can cost more than the item itself. By analogy, grainsize is setting a "minimum order" threshold for parallelization.


Even OpenMP, normally full of automagic, has grainsize parameters. We're looking into ways to automatically determine grainsize, but it's not easy. One possibility is that on a P core machine, for an N-iteration loop, we could use N/(k*P), for a small fixed value of k. For a lot of cases we've looked at, that works pretty well. But it has the disturbing feature that if P grows large, the grains may become very small, with a net effect that having more cores causes packaging-and-handling to swamp useful work.

The issue is not limited to parallelism. Ever try writing a large file using a single fwrite and fflush per character? Picking appropriate chunk sizes is common in programming.

A good grainsize for the TBB loop templates can usually be chosen independent of the number of processors, because the packaging-and-handling overhead is more or less per grain, independent of the number or processors. E.g., pick grains large enough that overhead is down to an acceptable % of useful work. There's always going to be a tradeoff because too large a grain limits parallelism. To return to the mail order analogy, you can greatly reduce packaging-and-handling costs by placing a single monster lifetime order, but who wants to wait that long?

For more complete information about compiler optimizations, see our Optimization Notice.