How to make a pipeline with an Intel® Threading Building Blocks flow graph

The Intel® Threading Building Blocks ( Intel® TBB ) flow graph is fully supported in Intel TBB 4.0. If you are unfamiliar with the flow graph, you can read an introduction here.

A question was recently submitted about an implementation of a pipeline using a flow graph.  That question made me realize that pointing out some of the subtle differences between an Intel TBB pipeline and an Intel TBB flow graph might prevent some confusion among users.

As a running example, I will convert the pipeline example that ships with the Intel TBB distribution, examples/pipeline/square. This example uses the three-stage pipeline shown in Figure 1.



Figure 1: The three-stage pipeline using in examples/pipeline/square.

In the square example, the input filter reads blocks of strings from a input file.  Each block is of a fixed size and when a complete block is read, it is sent to the next filter transform. The second filter is a parallel filter that allocates an output buffer, converts each string in the input buffer into a long, squares the value, and then writes the result to the output buffer. It passes the output buffer to the final filter, output, which is a serial-in-order filter. The output filter processes buffers one at a time in the order that they were created by the input filter. It dumps the resulting squared numbers to an output file. Because the buffers are processed in-order, the values written to the output file match the ordering of their corresponding values in the input file.

Figure 2 shows a straightforward conversion of this example to a flow graph. The input filter becomes a source_node, the second filter becomes a function_node with unlimited concurrency, and the final filter becomes a function_node with a concurrency of 1.   For the most part, the rectangles have become circles :)   But unfortunately, this implementation will both generate incorrect results and perform poorly.



Figure 2: A straightforward (but incorrect) translation to a flow graph.

First, let's address the correctness issue. The final node in our flow graph has a concurrency of 1, and so is serial; however it operates on the buffers in first-in-first-out order. The pipeline version, on the other hand, used a serial-in-order-filter, which allows it to re-establish the order in which the items were generated by the input filter. We can mimic this behavior by adding a sequencer node, as shown in Figure 3.



Figure 3: A correct (but slow) translation to a flow graph.

Adding a sequencer_node to our flow graph requires that we assign a sequence number to each buffer as it is allocated at the input filter. We also then must write a function that can return that sequence value for given a buffer, and provide this function to the sequencer_node.

The flow graph shown in Figure 3 still has a problem. The source_node input is directly attached to a function_node transform that has unlimited concurrency. This means that transform will accept everything that input sends to it. The input node will happily keep allocating new buffers, flooding the flow graph with inputs. In the best case, this will bog down the system resulting in poor performance. In the worst case, this could cause the application to run out of memory and crash.

Such a situation is not possible in an Intel TBB pipeline because of its token-based scheduling mechanism. When a pipeline is run, the user provides a fixed number of tokens to the pipeline. There will be at most one item per token in the pipeline. In the square example, the pipeline is run with 4 tokens per core:

pipeline.run( nthreads*4 );

A flow graph does not have token-based scheduling. Because nodes in a flow graph can generate multiple outputs, join messages together and split messages apart, it becomes difficult if not impossible to implement a token-based system that does not provide the potential for deadlock. However, the flow graph does provide alternative methods for restricting memory use. One of these methods is the use of a limiter_node, as shown in Figure 4.



Figure 4: A correct and efficient translation to a flow graph.

In Figure 4, a limiter_node is inserted between the input filter and the transform filter. A limiter_node has a user-set threshold of messages that it will allow to pass through before rejecting additional messages. It also has a second input port that can be used to decrement its internal count of messages it has forwarded. If the limiter_node in Figure 4 is constructed with a threshold of 4*nthreads it will cap memory use similarly to the pipeline implementation. The edge from the output filter back to the limiter_node will signal the node to allow additional inputs in from the input filter.

Although it is not demonstrated in the square example, a pipeline can also have a parallel input filter. In contrast, a source_node is always serial. This restriction is again due to the lack of token-based scheduling in the flow graph. The number of active copies of a pipeline’s input filter is bounded by the number of available tokens. In a flow graph there would be no such bound, and no practical way to determine an appropriate number of source_node bodies to activate concurrently. There are ways to mimic the behavior of a parallel input filter in a flow graph, but I’ll leave that discussion for a future blog post.

Below is the snippet of code from examples/pipeline/square that sets up the structure of the Intel TBB pipeline shown in Figure 1. Each of the three filters is created and then added to the pipeline. Finally the pipeline is run with 4*nthreads tokens.

tbb::pipeline pipeline;

MyInputFilter input_filter( input_file );
pipeline.add_filter( input_filter );

MyTransformFilter transform_filter;
pipeline.add_filter( transform_filter );

MyOutputFilter output_filter( output_file );
pipeline.add_filter( output_filter );

pipeline.run( nthreads*4 );

Code that can be used to set up the structure of the Intel TBB flow graph shown in Figure 4 is shown below. Each of the nodes is created and then edges are added between them. Note that the limiter_node is passed, 4*nthreads as its threshold, and that a sequencer_node is placed before output. Finally, the source_node is activated and then wait_for_all is called on the flow graph to wait for it to complete.

tbb::flow::graph g;

tbb::flow::limiter_node limiter( g, nthreads*4 );
tbb::flow::sequencer_node< TextSlice * > sequencer(g, sequencer_body() );

tbb::flow::source_node input( g, MyInputFilter(input_file), false );
tbb::flow::function_node transform( g, tbb::flow::unlimited, MyTransformFilter() );
tbb::flow::function_node output( g, tbb::flow::serial, MyOutputFilter( output_file ) );

tbb::flow::make_edge( input, limiter );
tbb::flow::make_edge( limiter, transform );
tbb::flow::make_edge( transform, sequencer );
tbb::flow::make_edge( sequencer, output );
tbb::flow::make_edge( output, limiter.decrement );

input.activate();
g.wait_for_all();

The flow graph code is clearly more verbose than the pipeline version, but their performance will be comparable. As described here, applications that are structured like pipelines will be more easily expressed using the Intel TBB pipeline, but the Intel TBB flow graph has a more flexible API and therefore can express applications that are not amenable to a linear pipeline.

In conclusion, there are subtle but important differences between an Intel TBB pipeline and an Intel TBB flow graph. First, pipeline supports serial-in-order nodes, while the flow graph requires a sequencer_node to get similar behavior.  Second, and more importantly, a pipeline uses token-based scheduling while a flow graph does not. The flow graph API does include alternatives to token-based scheduling such as the limiter_node discussed in this post.  In an earlier post,
A feature-detection example using the Intel® Threading Building Blocks flow graph, I demonstrated another alternative for managing memory use, a reserving join node. And finally, because of the lack of token-based scheduling, a source_node is always serial and more complex methods must be used to mimic a pipeline’s parallel input filter.

To learn more about other features in Intel® Threading Building Blocks 4.0, visit http://www.threadingbuildingblocks.org or to learn more about the Intel® TBB flow graph, check-out the other blog articles at /en-us/blogs/tag/flow_graph/.

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.