Using the Intel® Threading Building Blocks Graph Community Preview Feature: Creating a Simple Message Graph.

Intel® Threading Building Blocks (Intel® TBB) Version 3 Update 5 introduced the class graph as a Community Preview (CP) feature. There is an introductory post that provides an overview of the class and the nodes that can be used with it. You can download the open-source version of this release at and are encouraged to provide feedback about the graph via the forum. In a previous post, I provided an example that created a dependency graph. In this post, I describe how to make simple message graph using class graph.

This example will calculate the sum of x*x + x*x*x for all x = 1 to 10. This is a simple syntactic example only. Since each node in a graph may execute as an independent task, the granularity of each node should follow the general guidelines for tasks as described in Section 3.2.3 of the Intel® Threading Building Blocks Tutorial. But for demonstration purposes, I will use an artificial, tiny example here and inflate the time spent in each node by sleeping for 1 second after each operation.  I use the Linux function sleep for this. If you want to enter this example yourself, you'll have to use the appropriate sleep function for your system.

The basic layout of the graph that I’ll create is shown in the figure below. Each value enters through the input node. This node will broadcast the value to both squarer and cuber, which will calculate x*x and x*x*x respectively (and sleep for 1 second). The output of each of these nodes will be placed in an unbounded buffer. A tuple containing both values will be created by the join node and forwarded to summer, which will add both values to the running total (and sleep for 1 second). The squarer and cuber will allow unlimited concurreny, that is they will be allowed to process multiple values simultaneously. The final summer, which updates a shared total, will be only allowed process a single in-coming tuple at a time, eliminating the need for a lock around the shared value.

As with all Community Preview Features, the graph must be explicitly enabled. This is done by defining its macro, TBB_PREVIEW_GRAPH, before including the header file as shown below.

#include "tbb/graph.h"

This example performs three basic types of operations: square, cube and sum. The classes below define these operations and will be use as the body objects for my function_nodes.

struct square {
  int operator()(int v) {
    printf(“squaring %d\n”, v);
    return v*v;

struct cube {
  int operator()(int v) {
    printf(“cubing %d\n”, v);
    return v*v*v;

class sum {
  int &my_sum;
  sum( int &s ) : my_sum(s) {}
  int operator()( std::tuple<int,int> v ) {
    printf(“adding %d and %d to %d\n”, std::get<0>(v), std::get<1>(v), my_sum);
    my_sum += std::get<0>(v) + std::get<1>(v);
    return my_sum;

In function main, the graph is setup and then the values 1 – 10 are put into the input node. All the nodes in this example pass around values of type int. The nodes used below are all class templates and therefore can be used with any type that supports copy construction, including pointers and objects. It should be noted that values are copied as they pass between nodes, so passing around large objects should be avoided.

using namespace tbb;

int main() {
  int result = 0;

  graph g;
  broadcast_node<int> input;
  function_node<int,int> squarer( g, graph::unlimited, square() );
  buffer_node<int> square_buffer(g);
  function_node<int,int> cuber( g, graph::unlimited, cube() );
  buffer_node<int> cube_buffer(g);
  join_node<int,int> j( g );
  function_node<std::tuple<int,int>,int> summer( g, graph::serial, sum(result) );

  make_edge( input, squarer );
  make_edge( input, cuber );
  make_edge( squarer, square_buffer );
  make_edge( square_buffer, std::get<0>( j.inputs() ) );
  make_edge( cuber, cube_buffer );
  make_edge( cube_buffer, std::get<1>( j.inputs() ) );
  make_edge( j, summer );

  for (int i = 1; i <= 10; ++i)
  printf("Final result is %d\n", result);
  return 0;

Towards the top of the code above, the graph nodes are created: 1 broadcast_node, 3 function_node objects , 2 buffer_node objects, and 1 join_node. After the nodes are created, they are linked together using make_edge calls. Both the nodes and edges correspond directly to the figure presented earlier.

Once the graph has been setup, the values 1-10 are put into the input node. The wait_for_all call will block until there is no more activity in the graph. The result of the computation is then printed to stdout and should look something like:

cubing 10
squaring 1
cubing 1
squaring 2
cubing 2
cubing 3
squaring 4
squaring 3
squaring 5
cubing 5
cubing 4
squaring 6
cubing 6
squaring 7
cubing 7
squaring 8
squaring 10
adding 1 and 1000 to 0
cubing 8
squaring 9
cubing 9
adding 4 and 1 to 1001
adding 9 and 8 to 1006
adding 25 and 27 to 1023
adding 36 and 125 to 1075
adding 49 and 64 to 1236
adding 64 and 216 to 1349
adding 16 and 343 to 1629
adding 100 and 512 to 1988
adding 81 and 729 to 2600
Final result is 3410

One might note that most of the square and cube operations happened before any of the sum operations in the above run. This is an artifact of using a for loop to inject the values 1-10 into the graph. In a future example, I'll demonstrate how a source_node can be used to create a better execution ordering.

For more complete information about compiler optimizations, see our Optimization Notice.