I had assumed that grainsize referred to the smallest chunk of work into which a range would ever be divided. The TBB book says "The grainsize effectively sets a minimum threshold for parallelization" (p36). This would make it similar to the chunk size in OpenMP guided scheduling. However looking at the code, it actually subdivides as long as the current size exceeds the grainsize. This means you can end up with a minimum range almost half of the grainsize specified.
I understand the grainsize does not need to be set exactly, and we normally end up picking a value by experimentation anyway, but the current behaviour is a little unexpected. Not asking for any change, just maybe the behaviour could be clarified a little in documentation.
And a minor detail - if the size is exactly equal to the grainsize then
a parallel sort will divide the range but a parallel reduce or any operation using a blocked range