I'm new to thetbb::flow graph and am trying to learn how to use the different nodes. As part of the learning process, I'm trying to see what is the equivalent of the following program that I have written using tbb::pipeline in tbb:Lflow.
The program is FERRET, a content-based image retrieval system from the PARSEC benchmark. The image below illustrates the different stages. The first and last stages are serial; the remaining stages are parallel. I've made each stage extend tbb::filter and marked it as serial/parallel as appropriate.
I then tried to implement it using tbb::flow and the different graph_nodes available. The first stage is a source_node (serial by default); the middle stages are function_nodes with queuing policy and flow::unlimited concurrency; the last stage is a function_node with 1 as the concurrency limit (serial).Here's the pseudo-code:
///////////////////////////////////////////// // Create pipeline stages as tbb::flow graph ///////////////////////////////////////////// tbb::flow::graph cbir; tbb::flow::source_node input(cbir, Read( query_dir, &cnt_enqueue ) ); tbb::flow::function_node segmenter( cbir, tbb::flow::unlimited, SegmentImage() ); tbb::flow::function_node extracter( cbir, tbb::flow::unlimited, ExtractFeatures() ); tbb::flow::function_node querier( cbir, tbb::flow::unlimited, QueryIndex( ... ) ); tbb::flow::function_node ranker( cbir, tbb::flow::unlimited, RankCandidates( ... ) ); tbb::flow::function_node writer( cbir, 1, Write( ... ) ); /////////////////// // Chain up stages /////////////////// tbb::flow::make_edge(input, segmenter); tbb::flow::make_edge(segmenter, extracter); tbb::flow::make_edge(extracter, querier); tbb::flow::make_edge(querier, ranker); tbb::flow::make_edge(ranker, writer); //////////////// // Run pipeline //////////////// cbir.wait_for_all();
When I run the program, it behaves correctly. However, it is much slower (about 4 times slower on a 4-core machine) than the pthreads and tbb::pipeline version. I suspect that I must be doing something wrong i.e. I am not configuring the middle stages as parallel stages correctly. I would appreciate advice on the following:
- Is function_node with queuing policy and flow::unlimited the right graph_node to mimic the behavior of a parallel tbb::filter?
- If I am using function_node correctly, then could someone suggest other possible sources of performance bottlenecks that I should investigate?
As mentioned, I'm new to tbb::flow so it's quite possible that I'm missing something obvious.