What's New? Intel® Threading Building Blocks 4.4

One of the best known C++ threading libraries Intel® Threading Building Blocks (Intel® TBB) was recently updated to a new release 4.4. The updated version contains several key new features when compared to the previous 4.3 release. Some of them were already released in Intel TBB 4.3 updates.

Some highlights of the new release:

  • Global control for better resource management (parallelism level and thread stack size)
  • New Flow Graph nodes types: composite_node and async_node. Flow graph also has improved “reset” functionality.
  • More C++11 features are utilized to improve performance.

Global control

Many use cases require controlling the number of worker threads in the application. Intel TBB allows users to do that via arguments for a tbb::task_scheduler_init object:

tbb::task_scheduler_init my_scheduler(8);

However an application may have multiple plugins or master threads each using Intel TBB, and the tbb::task_scheduler_init can be instantiated multiple times. In those cases limiting the actual number of worker threads may become complicated.

To solve this problem, Intel TBB introduced the tbb::global_control class. Instantiating an object of this class with parameter “global_control::max_allowed_parallelism” limits the number of active worker threads. The main difference from the tbb::task_scheduler_init is that the limit is application-wide. Even if Intel TBB is initialized in multiple parts of the application, if tbb::task_scheduler_init objects are created in different master threads, the total number of running TBB threads will be limited to the specified value.

#include "tbb/parallel_for.h"
#include "tbb/task_scheduler_init.h"
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include "tbb/global_control.h"

using namespace tbb;

void foo()
{
    // The following code could use up to 16 threads.
    task_scheduler_init tsi(16);
    parallel_for( . . . );
}

void bar()
{
    // The following code could use up to 8 threads.
    task_scheduler_init tsi(8);
    parallel_for( . . . );
}

int main()
{
    {
        const size_t parallelism = task_scheduler_init::default_num_threads();
        // total parallelism that TBB can utilize is cut in half for the dynamic extension
        // of the given scope, including calls to foo() and bar()
        global_control c(global_control::max_allowed_parallelism, parallelism/2);
        foo();
        bar();
    } // restore previous parallelism limitation, if one existed
}

In this example, functions foo() and bar() initialize the TBB task scheduler locally. However, the global_control object in main function sets the upper limit for total number of active threads. If we used another task_scheduler_init object in the main function instead of the global_control, re-initializing of TBB in foo() and bar() wouldn’t happen since the main thread would already have an active task_scheduler_init object.   The local settings would therefore be ignored and foo() and bar()  would use the same number of threads as was specified in main().  Using a global_control object, a maximum can be enforced while local control within that maximum is retained.

global_control objects can be nested. When new instance is created, it can override the thread limit to lower value (increasing thread limit doesn’t happen this way). And, once the instance goes out of scope, the previous settings are restored.

tbb::global_control is a preview feature in intel TBB 4.4. The class also allows limiting thread stack size via thread_stack_size parameter.

 

Flow Graph

composite_node

Intel TBB Flow Graph was extended with new node types. tbb::flow::composite_node can package any number of other nodes. Large applications with many nodes can be structured better, since composite_node can represent big functional blocks with defined interfaces (input and output).

This example shows the use of composite_node to encapsulate two flow graph nodes (a join_node and a function_node). In the example the concept that the sum of the first n positive odd numbers is the same as n squared is demonstrated.

A class adder is defined. This class has a join_node j with two input ports and a function_node f. j receives a number at each of its input ports and sends a tuple of these numbers to f which adds the numbers. To encapsulate these two nodes, the adder inherits from a composite_node type with two input ports and one output port to match the two input ports of j and the one output port of f.

A split_node s is created to serve as the source of the positive odd numbers. The first four positive odd numbers 1, 3, 5 and 7 are used. Three adders a0, a1 and a2 are created. The first adder a0 receives 1 and 3 from the split_node. These are added and the sum forwarded to a1. The second adder a1 receives the sum of 1 and 3 on one input port and receives 5 on the other input port from the split_node. These are also added and the sum forwarded to a2. Likewise, the third adder a2 receives the sum of 1, 3 and 5 on one input port and receives 7 on the other input port from the split_node. Each adder reports the sum it computes which is the square of the count of numbers accumulated when that adder is reached in the graph.

#include "tbb/flow_graph.h"
#include <iostream>
#include <tuple>
using namespace tbb::flow;

class adder : public  composite_node<  tuple< int, int >,  tuple< int > > {
    join_node<  tuple< int, int >,  queueing > j;
    function_node<  tuple< int, int >, int > f;
    typedef  composite_node<  tuple< int, int >,  tuple< int > > base_type;

    struct f_body {
        int operator()( const  tuple< int, int > &t ) {
            int n = (get<1>(t)+1)/2;
            int sum = get<0>(t) + get<1>(t);
            std::cout << "Sum of the first " << n <<" positive odd numbers is  " << n <<" squared: "  << sum << std::endl; 
            return  sum;
        }
    };

public:
    adder( graph &g) : base_type(g), j(g), f(g,  unlimited, f_body() ) {
        make_edge( j, f );
        base_type::input_ports_type input_tuple(input_port<0>(j), input_port<1>(j));
        base_type::output_ports_type output_tuple(f);
        base_type::set_external_ports(input_tuple, output_tuple); 
    }
};

int main() {
    graph g;
    split_node< tuple<int, int, int, int> > s(g);
    adder a0(g);
    adder a1(g);
    adder a2(g);
  
    make_edge(output_port<0>(s), input_port<0>(a0));
    make_edge(output_port<1>(s), input_port<1>(a0));

    make_edge(output_port<0>(a0),input_port<0>(a1));
    make_edge(output_port<2>(s), input_port<1>(a1));

    make_edge(output_port<0>(a1), input_port<0>(a2));
    make_edge(output_port<3>(s), input_port<1>(a2));

    s.try_put(std::make_tuple(1,3,5,7));
    g.wait_for_all();
    return 0;
}

async_node

Template class async_node allows users to coordinate with  an activity that is serviced outside of the TBB thread pool. If your flow graph application needs to communicate to a separate thread, runtime or device, async_node might be helpful. It has interfaces to commit results back, maintaining two-way asynchronous communication between a TBB flow graph and an external computing entity. async_node class is a preview feature in Intel TBB 4.4.

 

Resetting flow graph

You can now reset Intel TBB flow graph state after an “unclean shutdown”, e.g. an exception thrown or an explicit graph cancelation. Call tbb::flow::graph::reset(reset_flags f) to perform a cleanup:

·         Removal of all edges of a graph (using reset(rf_clear_edges)).

·         Reset of all function bodies of a graph (using reset(rf_reset_bodies)).

Additionally, the following operations with a flow graph node are available as preview functionality:

·         Extraction of an individual node from a flow graph (preview feature).

·         Retrieval of the number of predecessors and successors of a node (preview feature).

·         Retrieval of a copy of all predecessors and successors of a node (preview feature).

C++ 11

C++11 move operations help avoiding unnecessary data copies. Intel TBB 4.4 adds move-aware insert and emplace methods to concurrent_unordered_map and concurrent_hash_map containers. concurrent_vector::shrink_to_fit was optimized for types that support C++11 move semantics.

tbb::enumerable_thread_specific container has added a move constructor  and an assignment operator. Thread-local values can now be constructed from an arbitrary number of arguments via a constructor that uses variadic templates.

The tbb/compat/thread header was updated to automatically include C++11 <thread> where available. Exact exception propagation is enabled for Intel C++ Compiler on OS X*.

 

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb

For more complete information about compiler optimizations, see our Optimization Notice.