TBB: balanced splits vs unbalanced but simple splits

Hello,

this is just a word about a cautionary tale regarding writing your own Ranges for Intel's Threading Building Blocks library.

In our project we use a 2D iterator that iterates over all elements in an upper triangle (ie think an upper triangle matrix and you hit all cells != 0).

I've implemented two approaches. One using simple splits that splits the triangle in a 2:1 ratio:
\----|
\ --|
and
\-|
\|

and one version that splits it ~1:1:
\-|
\|
\-|
\|
and
|-|
|-|

The second version requires a more complex iterator, because you have to iterate over two triangles.

For big datasets and many cores both versions seem to run equally fast, because load balancing in TBB seems to work well with sufficiently many available tasks.
However for small- to medium-sized datasets the first version, which allows for very simple iterator code, is generally faster by 10-15%.

Take-away: Prefer simple code over fair splits and let TBB's scheduler perform load-balancing.

How the splits work:
The range is a state machine with 3 states each, when you think about it:
A upper triangle,
B block,
C truncated upper triangle or upper triangle without block

For the 1:1 ratio split:

  • an A range is split into a B range and a C range (as you can see above)

  • a B range is split into 2 B ranges

  • a C range is split into 2 A ranges


For the 2:1 ratio split:

  • an A range is split into a C range and a A range

  • a B range is split into 2 B ranges

  • a C range is split into an A range and a B range


Using these state transitions any valid range can be subsequently split into 1x1 cells.
Categories:
For more complete information about compiler optimizations, see our Optimization Notice.