Implementing a wave-front computation using the Intel® Threading Building Blocks flow graph

The Intel® Threading Building Blocks ( Intel® TBB )  flow graph is fully supported in Intel® TBB 4.0.  If you are unfamiliar with the flow graph, you can read an introduction here.

One node type available for use with the flow graph is continue_node<T>.  This node type is designed for implementing dependency graphs, where nodes wait for their predecessors to complete before beginning their own work.   A continue_node does not receive data messages from its predecessors, but instead counts the number of continue_msg signals that it receives.  Once it receives P messages, one for each predecessor, it executes its body which generates an output message of type T.  Often, the output type is also a continue_msg but it need not be. 

Pictorially, we draw a continue_node as below:


This symbol tries to convey important properties about a continue_node.  The input arc has lines above it to indicate that it counts incoming messages.  The interior of the circle contains f() to indicate that the body is a functor that is passed no argument.

Figure 1 shows an approach to implementing a wave front computation using a set of continue_node objects.  In this example, each computation must wait for the computation above it and the computation to its left to complete before it can start executing.  Most nodes have two predecessors and therefore will not start executing until they receive two continue_msg messages.  Nodes on the top and left edges have only a single predecessor and therefore wait for only a single message to arrive.

Figure 1: Using an Intel® Threading Building Blocks flow graph to express a wave-front calculation

I'll now provide the complete code necessary to implement an example that performs such a computation using an Intel® TBB flow graph. In this example, the computation at each node will update a block of a 2 dimensional matrix. If the values held by an element's left and upper neighbors are equal, then the element's value will be set to be 2 times that value. Otherwise, the element's value will be set to be the maximum of the two values. The top-left element is initialize with a value of 1. So for a 5x5 matrix, the results would be:

1 1 1 1 1
1 2 2 2 2
1 2 4 4 4
1 2 4 8 8
1 2 4 8 16

The code below includes the necessary headers, defines some parameters, and defines the function calc that performs the calculation on each matrix element. The constants M and N define the size of the matrix. The dimension of the blocks computed in each node is given by the blocksize, B, and the number of blocks in each dimenstion is computed and stored in MB and NB. The 2-D matrix, values, is where the results will be stored.

#include <algorithm> // for std::max
#include <cstdio>

#include "tbb/flow_graph.h"

using namespace tbb;
using namespace tbb::flow;

int M=1000, N=1000;
int B = 100;
int MB = (M/B) + (M%B>0);
int NB = (M/B) + (M%B>0);

double **value;

inline double calc( double v0, double v1 ) {
  if ( v0 == v1 )
    return 2*v0;
    return std::max(v0,v1);

The code below builds the flow graph that will apply the function calc to the matrix blocks, while respecting the dependencies shown in Figure 1. The 2-D array, node, is used to hold pointers to the continue_node objects. In BuildGraph, the doubly-nested for loop allocates the continue_node objects. Each node is constructed with a reference to the graph object g and lambda expression that applies calc to its corresponding block of elements. After each node is created, edges are made from it to its successors in the graph, setting up the required dependencies. Note that the loop indices move from the bottom right of Figure 1 to the top left, so each node's successors are allocated before it.

continue_node<continue_msg> ***node;

void BuildGraph( graph &g ) {
  value[M-1][N-1] = 0;
  for( int i=MB; --i>=0; )
    for( int j=NB; --j>=0; ) {
      node[i][j] =
        new continue_node<continue_msg>( g,
                         [=]( const continue_msg& ) {
                           int start_i = i*B;
                           int end_i = (i*B+B > M) ? M : i*B+B;
                           int start_j = j*B;
                           int end_j = (j*B+B > N) ? N : j*B+B;
                           for ( int ii = start_i; ii < end_i; ++ii ) {
                             for ( int jj = start_j; jj < end_j; ++jj ) {
                               double v0 = ii == 0 ? 0 : value[ii-1][jj];
                               double v1 = jj == 0 ? 0 : value[ii][jj-1];
                               value[ii][jj] = ii==0 && jj==0 ? 1 : calc(v0,v1);
                         } );
      if ( i + 1 < MB ) make_edge( *node[i][j], *node[i+1][j] );
      if ( j + 1 < NB ) make_edge( *node[i][j], *node[i][j+1] );

The function EvaluateGraph executes the flow graph. It does this by putting a continue_msg to the top-left element, and then waiting for the activity in the graph to stop. When the call to g.wait_for_all() returns, all of the nodes have been evaluated and the final result produced.

double EvaluateGraph( graph &g ) {
  return value[M-1][N-1];

Since we create a matrix of continue_node objects, we also have to delete them:

void CleanupGraph() {
  for( int i=0; i<MB; ++i )
    for( int j=0; j<NB; ++j )
     delete node[i][j];

Finally, the main function shown below invokes these functions to build, evaluate and clean up the flow graph.

int main(int argc, char *argv[]) {
  value = new double *[M];
  for ( int i = 0; i < M; ++i ) value[i] = new double [N];

  node = new continue_node<continue_msg> **[MB];
  for ( int i = 0; i < MB; ++i ) node[i] = new continue_node<continue_msg> *[NB];

  graph g;
  double result = EvaluateGraph(g);
  printf("%g\n", result);

  for ( int i = 0; i < M; ++i ) delete [] value[i];
  for ( int i = 0; i < MB; ++i ) delete [] node[i];
  delete [] value;
  delete [] node;

  return 0;

I hope that this example demonstrates that a flow graph can be used to easily express a depedency graph. The basic steps are (1) create a set of continue_node objects, (2) connect these nodes together using calls to make_edge, and (3) start the execution by sending a continue_msg to any nodes that do not have predecessors.

If you are interested in learning more about the Intel® Threading Building Blocks ( Intel® TBB ) flow graph, please check out the other blog articles at /en-us/blogs/tag/flow_graph or visit

For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

Great post of flow graph !
In the second paragraph "Once it receives P messages...", does 'P' mean number of its predecessor nodes ?


anonymous's picture

good post....

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.