Intel® Threading Building Blocks Release Notes and New Features

2018

Update 1

Release Notes

 

The updated version (Open Source release only) contains these additions:

  • lambda-friendly overloads for parallel_scan.
  • support of static and simple partitioners in parallel_deterministic_reduce.

We also introduced a few preview features:

  • initial support for Flow Graph Analyzer to do parallel_for.
  • reservation support in overwrite_node and write_once_node.

Bugs fixed

  • Fixed a potential deadlock scenario in the flow graph that affected Intel® TBB 2018 Initial Release.
Initial Release

Release Notes

 

One of the best known C++ threading libraries Intel® Threading Building Blocks (Intel® TBB) was recently updated to a new release 2018. The updated version contains several key new features when compared to the previous 2017 Update 7 release (https://software.intel.com/en-us/articles/whats-new-intel-threading-building-blocks-2017-update-7).

Licensing

Intel® TBB outbound license for commercial support is Intel Simplified Software License: https://software.intel.com/en-us/license/intel-simplified-software-license. The license for open source distribution has not changed.

Tasks

Intel® TBB is now fully supports this_task_arena::isolate() function. Also, this_task_arena::isolate() function and task_arena::execute() methods were extended to pass on the value returned by the executed functor (this feature requires C++11). The task_arena::enqueue() and task_group::run() methods extended to accept move-only functors.

Flow Graph

A flow graph now spawns all tasks into the same task arena and waiting for graph completion also happens in that arena.

There are some changes affecting backward compatibility:

  • Internal layout changes in some flow graph classes;

  • Several undocumented methods are removed from class graph, including set_active() and is_active().

  • Due to incompatible changes, the namespace version is updated for the flow graph; recompilation is recommended for all binaries that use the flow graph classes.

We also introduced a few preview features:

  • opencl_node can be used with any graph object; class opencl_graph is removed.

  • graph::wait_for_all() now automatically waits for all not yet consumed async_msg objects.

Flow Graph Analyzer (FGA) is available as a technology preview in Intel® Parallel Studio XE 2018 and as a feature of Intel® Advisor https://software.intel.com/en-us/articles/getting-started-with-flow-graph-analyzer.The support for FGA tool in async_node, opencl_node and composite_node has been improved.

Introduction of Parallel STL

Parallel STL, an implementation of the C++ standard library algorithms with support for execution policies, has been introduced. Parallel STL relies on Intel® TBB underneath. For more information, see Getting Started with Parallel STL (https://software.intel.com/en-us/get-started-with-pstl).

Additional support for Android*, UWP, macOS

  • Added support for Android* NDK r15, r15b.
  • Added support for Universal Windows Platform.
  • Increased minimally supported version of macOS* (MACOSX_DEPLOYMENT_TARGET) to 10.11.

Bugs fixed

  • Fixed a bug preventing use of streaming_node and opencl_node with Clang; inspired by a contribution from Francisco Facioni.
  • Fixed this_task_arena::isolate() function to work correctly with parallel_invoke and parallel_do algorithms.
  • Fixed a memory leak in composite_node.
  • Fixed an assertion failure in debug tbbmalloc binaries when TBBMALLOC_CLEAN_ALL_BUFFERS is used.

Downloads

You can download the latest Intel® TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/intel-tbb.

In addition, Intel® TBB ca be installed using:

Improved insights in Intel® VTune™ Amplifier 2018

Intel® VTune™ Amplifier 2018 (https://software.intel.com/en-us/vtune-amplifier-help) improved insight into parallelism inefficiencies for applications using Intel® Threading Building Blocks (Intel® TBB) with extended classification of high Overhead and Spin time: https://software.intel.com/en-us/articles/overhead-and-spin-time-issue-in-intel-threading-building-blocks-applications-due-to

Cmake support

Cmake support in Intel® TBB (https://github.com/01org/tbb/tree/tbb_2018/cmake) has been introduced as well.

Samples

All examples for the commercial version of the library were moved online: https://software.intel.com/en-us/product-code-samples. Examples are available as a standalone package or as a part of Intel® Parallel Studio XE or Intel® System Studio Online Samples packages

Documentation

The following documentation for Intel® TBB is available:

2017

Update 8

Release Notes

Bugs fixed

  • The assertion failure has been fixed in debug tbbmalloc binaries (commercial and Open Source releases) when TBBMALLOC_CLEAN_ALL_BUFFERS is used.
Update 7

Release Notes

 

The updated version contains a new bug fix when compared to the previous Intel® Threading Building Blocks (Intel® TBB) 2017 Update 6 release. Information about new features of previous release you can find under the following link.

Added functionality:
  • In the huge pages mode, the memory allocator now is also able to use transparent huge pages.
Preview Features:
  • Added support for Intel TBB integration into CMake-aware projects, with valuable guidance and feedback provided by Brad King (Kitware).
Bugs fixed:
  • Fixed scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS, 0) to process memory left after exited threads.

Intel TBB 2017 Update 7 is open source only release, you can download it from https://github.com/01org/tbb/releases.

 

Update 6

Release Notes

 

The updated version contains several bug fixes when compared to the previous Intel® Threading Building Blocks (Intel® TBB) 2017 Update 5 release. Information about new features of previous release you can find under the following link.

Added functionality:
  • Added support for Android* NDK r14.
Preview Features:
  • Added a blocking terminate extension to the task_scheduler_init class that allows an object to wait for termination of worker threads.
Bugs fixed:

Intel TBB is available to install now in YUM and APT repositories.

In addition, you can download the latest Intel TBB open source version from https://github.com/01org/tbb/releases.

 

Update 5

Release Notes

 

The updated version contains several bug fixes when compared to the previous Intel® Threading Building Blocks (Intel® TBB) 2017 Update 4 release. Information about new features of previous release you can find under the following link.

Added functionality:
  • Added support for Microsoft* Visual Studio* 2017.
  • Added graph/matmult example to demonstrate support for compute offload to Intel(R) Graphics Technology in the flow graph API.
  • The "compiler" build option now allows to specify a full path to the compiler.
Changes affecting backward compatibility:
  • Constructors for many classes, including graph nodes, concurrent containers, thread-local containers, etc., are declared explicit and cannot be used for implicit conversions anymore.
Bugs fixed:
  • Added a workaround for bug 16657 in the GNU C Library (glibc) affecting the debug version of tbb::mutex.
  • Fixed a crash in pool_identify() called for an object allocated in another thread.

 

Intel TBB 2017 U5 is available as a part of Intel(R) Parallel Studio XE 2018 Beta and is installed with Parallel STL, an implementation of the C++ standard library algorithms with support for execution policies. For more information about Parallel STL, see Getting Started and Release Notes.

In addition, you can download the latest Intel TBB open source version from https://github.com/01org/tbb/releases.

Update 4

Release Notes

 

The updated version contains several bug fixes when compared to the previous Intel® Threading Building Blocks (Intel® TBB) 2017 Update 3 release. Information about new features of previous releases you can find under the following links.

Added functionality:
  • Added support for C++11 move semantics in parallel_do.
  • Added support for FreeBSD* 11.
Changes affecting backward compatibility:
  • Minimal compiler versions required for support of C++11 move semantics raised to GCC 4.5, VS 2012, and Intel(R) C++ Compiler 14.0.
Bugs fixed:
  • The workaround for crashes in the library compiled with GCC 6 (-flifetime-dse=1) was extended to Windows*.

 

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb.

Update 3

Release Notes

 

Changes (w.r.t. Intel TBB 2017 Update 2):

- Added support for Android* 7.0 and Android* NDK r13, r13b.

Preview Features:

- Added template class gfx_factory to the flow graph API. It implements
the Factory concept for streaming_node to offload computations to
Intel(R) processor graphics.

Bugs fixed:

- Fixed a possible deadlock caused by missed wakeup signals in
task_arena::execute().

Heterogeneous TBB (flow graph promotion):

- TBB flow graph: using streaming_node

Update 2

Release Notes

 

The updated version contains several bug fixes when compared to the previous Intel® Threading Building Blocks (Intel® TBB) 2017 release. Information about new features of previous releases you can find under the following links.

Obsolete

Removed the long-outdated support for Xbox* consoles.

Bugs fixed:
  • Fixed the issue with task_arena::execute() not being processed when the calling thread cannot join the arena.
  • Fixed dynamic memory allocation replacement failure on macOS* 10.12.
  • Fixed dynamic memory allocation replacement failures on Windows* 10 Anniversary Update.
  • Fixed emplace() method of concurrent unordered containers to not require a copy constructor.

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb.

Update 1

Release Notes

 

Changes (w.r.t. Intel TBB 2017):

Bugs fixed:

- Fixed dynamic memory allocation replacement failures on Windows* 10
Anniversary Update.
- Fixed emplace() method of concurrent unordered containers not to
require a copy constructor.

Initial Release

Release Notes

 

One of the best known C++ threading libraries Intel® Threading Building Blocks (Intel® TBB) was recently updated to a new release 2017. The updated version contains several key new features when compared to the previous 4.4 release. Some of them were already released in Intel® TBB 4.4 updates.

Licensing

Like Intel® TBB 2.0, the Intel® TBB coming in 2017 brings both technical improvements and becomes more open with the switch to an Apache* 2.0 license, which should enable it to take root in more environments while continuing to simplify effective use of multicore hardware.

Parallel algorithms static_partitioner

Intel® TBB 2017 has expanded a set of partitioners with the tbb::static_partitioner. It can be used in tbb::parallel_for and tbb::parallel_reduce to split the work uniformly among workers. The work is initially split into chunks of approximately equal size. The number of chunks is determined at runtime to minimize the overhead of work splitting while providing enough tasks for available workers. Whether these chunks may be further split is unspecified. This reduces overheads involved when the work is originally well-balanced. However, it limits available parallelism and, therefore, might result in performance loss for non-balanced workloads.

Tasks

Added tbb::task_arena::max_concurency() method returning the maximal number of threads that can work inside an arena. The amount of concurrency reserved for application threads at tbb::task_arena construction can be set to any value between 0 and the arena concurrency limit.

Namespace tbb::this_task_arena is a concept to collect information about arena where the current task is executed now. It is propagated with new functionality:

  • In previous releases to get a current thread slot index in the current arena a tbb::task_arena::current_thread_index() static method was used. Now it is deprecated and functionality was moved to tbb::this_task_arena. Use tbb::this_task_arena::current_thread_index() function now.
  • added this_task_arena::max_concurrency() that returns maximum number of threads that can work on the current arena.
  • (Preview Feature) Use tbb::this_task_arena::isolate() function to isolate execution of a group of tasks or an algorithm from other tasks submitted to the scheduler.
Memory Allocation

Improved dynamic memory allocation replacement on Windows* OS to skip DLLs for which replacement cannot be done, instead of aborting.

For 64-bit platforms, quadrupled the worst-case limit on the amount of memory the Intel® TBB allocator can handle.

Intel® TBB no longer performs dynamic replacement of memory allocation functions for Microsoft Visual Studio 2008 and earlier versions.

Flow Graph async_node

 

Now it’s a fully supported feature.

 

The tbb::flow::async_node is re-implemented using tbb::flow::multifunction_node template. This allows to specify a concurrency level for the node.

A class tmplate tbb::flow::async_node allows users to coordinate with an activity that is serviced outside of the Intel® TBB thread pool. If your flow graph application needs to communicate to a separate thread, runtime or device, tbb::flow::async_node might be helpful. It has interfaces to commit results back, maintaining two-way asynchronous communication between a Intel® TBB flow graph and an external computing entity. tbb::flow::async_node class was a preview feature in Intel® TBB 4.4.

 

async_msg

Since Intel TBB 4.4 Update 3 a special tbb::flow::async_msg message type was introduced to support communications between the flow graph and external asynchronous activities.

opencl_node

Streaming workloads to external computing devices is significantly reworked in this Intel® TBB 2017 and introduced as a preview feature. Intel® TBB flow graph now can be used as a composability layer for heterogeneous computing.

A class template tbb::flow::streaming_node was added to the flow graph API. It allows a flow graph to offload computations to other devices through streaming or offloading APIs. The “streaming” concept uses several abstractions like StreamFactory to produce instances of computational environments, kernel to encapsulate computing routine, device_selector to access a particular device.

The following example shows a simple OpenCL* kernel invocation.

File sqr.cl

__kernel
void Sqr( __global float *b2, __global float *b3   )
{
    const int index = get_global_id(0);
    b3[index] = b2[index]*b2[index];
}

File opencl_test.cpp

#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#define TBB_PREVIEW_FLOW_GRAPH_FEATURES 1

#include 
#include 
#include "tbb/flow_graph_opencl_node.h"
using namespace tbb::flow;

bool opencl_test()   {
   opencl_graph g;    
   const int N = 1 * 1024 * 1024;
   opencl_buffer  b2( g, N ), b3( g, N );
   std::vector  v2( N ), v3( N );

   auto i2 = b2.access();
   for ( int i = 0; i < N; ++i ) {
        i1[i] = v1[i] = float( i );
   }
   // Create an OpenCL program
   opencl_program<> p( g, PathToFile("sqr.cl") ) ;
   // Create an OpenCL computation node with kernel "Sqr" 
   opencl_node , opencl_buffer>> k2( g, p.get_kernel( "Sqr" ) );
   // define iteration range
   k2.set_range( {{ N },{ 16 }} );
   // initialize input and output buffers
   k2.try_put( std::tie( b2, b3 ) );
   // run the flow graph computations
   g.wait_for_all();

    // validation
    auto o3 = b3.access();
    bool comp_result = true;
    for ( int i = 0; i < N; ++i ) {
    	 comp_result &&= (o3[i] - v2[i] * v2[i]) < 0.1e-7;
    }
    return comp_result;
 }

 

Some other improvements in the Intel® TBB flow graph

  • Removed a few cases of excessive user data copying in the flow graph.
  • Reworked tbb::flow::split_node to eliminate unnecessary overheads.

Important note: Internal layout of some flow graph nodes has changed; recompilation is recommended for all binaries that use the flow graph.

 

Python

An experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.

Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib (used in multi-threading mode) execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL which in turn are parallelized using Intel® Math Kernel Library (Intel® MKL) or/and Intel® TBB.

The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s ThreadPool. Thanks to the monkey-patching technique implemented in class Monkey, no source code change is needed in order to unlock additional speedups.

For more details see: Unleash parallel performance of python programs

Miscellaneous
  • Added TBB_USE_GLIBCXX_VERSION macro to specify the version of GNU libstdc++ when it cannot be properly recognized, e.g. when used with Clang on Linux* OS.
  • Added support for C++11 move semantics to the argument of tbb::parallel_do_feeder::add() method.
  • Added C++11 move constructor and assignment operator to tbb::combinable class template.

 

Samples

All examples for commercial version of library moved online: https://software.intel.com/en-us/product-code-samples. Examples are available as a standalone package or as a part of Intel(R) Parallel Studio XE or Intel(R) System Studio Online Samples packages

Added graph/stereo example to demostrate tbb::flow::async_msg, and tbb::flow::opencl_node.

 

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb.

4.4

Update 6

Release Notes

 

Changes (w.r.t. Intel TBB 4.4 Update 5):

- For 64-bit platforms, quadrupled the worst-case limit on the amount
of memory the Intel TBB allocator can handle.

Bugs fixed:

- Fixed a memory corruption in the memory allocator when it meets
internal limits.
- Fixed the memory allocator on 64-bit platforms to align memory
to 16 bytes by default for all allocations bigger than 8 bytes.
- Fixed parallel_scan to provide correct result if the initial value
of an accumulator is not the operation identity value.
- As a workaround for crashes in the Intel TBB library compiled with
GCC 6, added -flifetime-dse=1 to compilation options on Linux* OS.

Update 5

Release Notes

 

Changes (w.r.t. Intel TBB 4.4 Update 4):

- Modified graph/fgbzip2 example to remove unnecessary data queuing.

Preview Features:

- Added a Python* module which is able to replace Python's thread pool
class with the implementation based on Intel TBB task scheduler.

Bugs fixed:

- Fixed the implementation of 64-bit tbb::atomic for IA-32 architecture
to work correctly with GCC 5.2 in C++11/14 mode.
- Fixed a possible crash when tasks with affinity (e.g. specified via
affinity_partitioner) are used simultaneously with task priority
changes.

You can download Intel TBB 4.4 update 5 from open source site.

Update 4

Release Notes

 

Changes (w.r.t. Intel TBB 4.4 Update 3):

- Removed a few cases of excessive user data copying in the flow graph.
- Improved robustness of concurrent_bounded_queue::abort() in case of
simultaneous push and pop operations.

Preview Features:

- Added tbb::flow::async_msg, a special message type to support
communications between the flow graph and external asynchronous
activities.
- async_node modified to support use with C++03 compilers.

Bugs fixed:

- Fixed a bug in dynamic memory allocation replacement for Windows* OS.
- Fixed excessive memory consumption on Linux* OS caused by enabling
zero-copy realloc.
- Fixed performance regression on Intel(R) Xeon Phi(tm) coprocessor with
auto_partitioner.

Update 3

Release Notes

 

Changes (w.r.t. Intel TBB 4.4 Update 2):

- Modified parallel_sort to not require a default constructor for values
and to use iter_swap() for value swapping.
- Added support for creating or initializing a task_arena instance that
is connected to the arena currently used by the thread.
- graph/binpack example modified to use multifunction_node.
- For performance analysis, use Intel(R) VTune(TM) Amplifier XE 2015
and higher; older versions are no longer supported.
- Improved support for compilation with disabled RTTI, by omitting its use
in auxiliary code, such as assertions. However some functionality,
particularly the flow graph, does not work if RTTI is disabled.
- The tachyon example for Android* can be built using Android Studio 1.5
and higher with experimental Gradle plugin 0.4.0.

Preview Features:

- Added class opencl_subbufer that allows using OpenCL* sub-buffer
objects with opencl_node.
- Class global_control supports the value of 1 for
max_allowed_parallelism.

Bugs fixed:

- Fixed a race causing "TBB Warning: setaffinity syscall failed" message.
- Fixed a compilation issue on OS X* with Intel(R) C++ Compiler 15.0.
- Fixed a bug in queuing_rw_mutex::downgrade() that could temporarily
block new readers.
- Fixed speculative_spin_rw_mutex to stop using the lazy subscription
technique due to its known flaws.
- Fixed memory leaks in the tool support code.

Update 2

Release Notes

 

Changes (w.r.t. Intel TBB 4.4 Update 1):

- Improved interoperability with Intel(R) OpenMP RTL (libiomp) on Linux:
OpenMP affinity settings do not affect the default number of threads
used in the task scheduler. Intel(R) C++ Compiler 16.0 Update 1
or later is required.
- Added a new flow graph example with different implementations of the
Cholesky Factorization algorithm.

Preview Features:

- Added template class opencl_node to the flow graph API. It allows a
flow graph to offload computations to OpenCL* devices.
- Extended join_node to use type-specified message keys. It simplifies
the API of the node by obtaining message keys via functions
associated with the message type (instead of node ports).
- Added static_partitioner that minimizes overhead of parallel_for and
parallel_reduce for well-balanced workloads.
- Improved template class async_node in the flow graph API to support
user settable concurrency limits.

Bugs fixed:

- Fixed a possible crash in the GUI layer for library examples on Linux.

 

Update 1

Release Notes

 

Changes (w.r.t. Intel TBB 4.4):

- Added support for Microsoft* Visual Studio* 2015.
- Intel TBB no longer performs dynamic replacement of memory allocation
functions for Microsoft Visual Studio 2005 and earlier versions.
- For GCC 4.7 and higher, the intrinsics-based platform isolation layer
uses __atomic_* built-ins instead of the legacy __sync_* ones.
This change is inspired by a contribution from Mathieu Malaterre.
- Improvements in task_arena:
Several application threads may join a task_arena and execute tasks
simultaneously. The amount of concurrency reserved for application
threads at task_arena construction can be set to any value between
0 and the arena concurrency limit.
- The fractal example was modified to demonstrate class task_arena
and moved to examples/task_arena/fractal.

Bugs fixed:

- Fixed a deadlock during destruction of task_scheduler_init objects
when one of destructors is set to wait for worker threads.
- Added a workaround for a possible crash on OS X* when dynamic memory
allocator replacement (libtbbmalloc_proxy) is used and memory is
released during application startup.
- Usage of mutable functors with task_group::run_and_wait() and
task_arena::enqueue() is disabled. An attempt to pass a functor
which operator()() is not const will produce compilation errors.
- Makefiles and environment scripts now properly recognize GCC 5.0 and
higher.

Open-source contributions integrated:

- Improved performance of parallel_for_each for inputs allowing random
access, by Raf Schietekat.

Initial Release

Release Notes

 

One of the best known C++ threading libraries Intel® Threading Building Blocks (Intel® TBB) was recently updated to a new release 4.4. The updated version contains several key new features when compared to the previous 4.3 release. Some of them were already released in Intel TBB 4.3 updates.

Some highlights of the new release:

  • Global control for better resource management (parallelism level and thread stack size)
  • New Flow Graph nodes types: composite_node and async_node. Flow graph also has improved “reset” functionality.
  • More C++11 features are utilized to improve performance.

Global control

Many use cases require controlling the number of worker threads in the application. Intel TBB allows users to do that via arguments for a tbb::task_scheduler_init object:

tbb::task_scheduler_init my_scheduler(8);

However an application may have multiple plugins or master threads each using Intel TBB, and the tbb::task_scheduler_init can be instantiated multiple times. In those cases limiting the actual number of worker threads may become complicated.

To solve this problem, Intel TBB introduced the tbb::global_control class. Instantiating an object of this class with parameter “global_control::max_allowed_parallelism” limits the number of active worker threads. The main difference from the tbb::task_scheduler_init is that the limit is application-wide. Even if Intel TBB is initialized in multiple parts of the application, if tbb::task_scheduler_init objects are created in different master threads, the total number of running TBB threads will be limited to the specified value.

#include "tbb/parallel_for.h"
#include "tbb/task_scheduler_init.h"
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include "tbb/global_control.h"

using namespace tbb;

void foo()
{
    // The following code could use up to 16 threads.
    task_scheduler_init tsi(16);
    parallel_for( . . . );
}

void bar()
{
    // The following code could use up to 8 threads.
    task_scheduler_init tsi(8);
    parallel_for( . . . );
}

int main()
{
    {
        const size_t parallelism = task_scheduler_init::default_num_threads();
        // total parallelism that TBB can utilize is cut in half for the dynamic extension
        // of the given scope, including calls to foo() and bar()
        global_control c(global_control::max_allowed_parallelism, parallelism/2);
        foo();
        bar();
    } // restore previous parallelism limitation, if one existed
}

In this example, functions foo() and bar() initialize the TBB task scheduler locally. However, the global_control object in main function sets the upper limit for total number of active threads. If we used another task_scheduler_init object in the main function instead of the global_control, re-initializing of TBB in foo() and bar() wouldn’t happen since the main thread would already have an active task_scheduler_init object. The local settings would therefore be ignored and foo() and bar() would use the same number of threads as was specified in main(). Using a global_control object, a maximum can be enforced while local control within that maximum is retained.

global_control objects can be nested. When new instance is created, it can override the thread limit to lower value (increasing thread limit doesn’t happen this way). And, once the instance goes out of scope, the previous settings are restored.

tbb::global_control is a preview feature in intel TBB 4.4. The class also allows limiting thread stack size via thread_stack_size parameter.

 

Flow Graph

composite_node

Intel TBB Flow Graph was extended with new node types. tbb::flow::composite_node can package any number of other nodes. Large applications with many nodes can be structured better, since composite_node can represent big functional blocks with defined interfaces (input and output).

This example shows the use of composite_node to encapsulate two flow graph nodes (a join_node and a function_node). In the example the concept that the sum of the first n positive odd numbers is the same as n squared is demonstrated.

A class adder is defined. This class has a join_node j with two input ports and a function_node f. j receives a number at each of its input ports and sends a tuple of these numbers to f which adds the numbers. To encapsulate these two nodes, the adder inherits from a composite_node type with two input ports and one output port to match the two input ports of j and the one output port of f.

A split_node s is created to serve as the source of the positive odd numbers. The first four positive odd numbers 1, 3, 5 and 7 are used. Three adders a0, a1 and a2 are created. The first adder a0 receives 1 and 3 from the split_node. These are added and the sum forwarded to a1. The second adder a1 receives the sum of 1 and 3 on one input port and receives 5 on the other input port from the split_node. These are also added and the sum forwarded to a2. Likewise, the third adder a2 receives the sum of 1, 3 and 5 on one input port and receives 7 on the other input port from the split_node. Each adder reports the sum it computes which is the square of the count of numbers accumulated when that adder is reached in the graph.

#include "tbb/flow_graph.h"
#include 
#include 
using namespace tbb::flow;

class adder : public  composite_node<  tuple< int, int >,  tuple< int > > {
    join_node<  tuple< int, int >,  queueing > j;
    function_node<  tuple< int, int >, int > f;
    typedef  composite_node<  tuple< int, int >,  tuple< int > > base_type;

    struct f_body {
        int operator()( const  tuple< int, int > &t ) {
            int n = (get<1>(t)+1)/2;
            int sum = get<0>(t) + get<1>(t);
            std::cout << "Sum of the first " << n <<" positive odd numbers is  " << n <<" squared: "  << sum << std::endl; 
            return  sum;
        }
    };

public:
    adder( graph &g) : base_type(g), j(g), f(g,  unlimited, f_body() ) {
        make_edge( j, f );
        base_type::input_ports_type input_tuple(input_port<0>(j), input_port<1>(j));
        base_type::output_ports_type output_tuple(f);
        base_type::set_external_ports(input_tuple, output_tuple); 
    }
};

int main() {
    graph g;
    split_node< tuple > s(g);
    adder a0(g);
    adder a1(g);
    adder a2(g);
  
    make_edge(output_port<0>(s), input_port<0>(a0));
    make_edge(output_port<1>(s), input_port<1>(a0));

    make_edge(output_port<0>(a0),input_port<0>(a1));
    make_edge(output_port<2>(s), input_port<1>(a1));

    make_edge(output_port<0>(a1), input_port<0>(a2));
    make_edge(output_port<3>(s), input_port<1>(a2));

    s.try_put(std::make_tuple(1,3,5,7));
    g.wait_for_all();
    return 0;
}

async_node

Template class async_node allows users to coordinate with an activity that is serviced outside of the TBB thread pool. If your flow graph application needs to communicate to a separate thread, runtime or device, async_node might be helpful. It has interfaces to commit results back, maintaining two-way asynchronous communication between a TBB flow graph and an external computing entity. async_node class is a preview feature in Intel TBB 4.4.

 

Resetting flow graph

You can now reset Intel TBB flow graph state after an “unclean shutdown”, e.g. an exception thrown or an explicit graph cancelation. Call tbb::flow::graph::reset(reset_flags f) to perform a cleanup:

· Removal of all edges of a graph (using reset(rf_clear_edges)).

· Reset of all function bodies of a graph (using reset(rf_reset_bodies)).

Additionally, the following operations with a flow graph node are available as preview functionality:

· Extraction of an individual node from a flow graph (preview feature).

· Retrieval of the number of predecessors and successors of a node (preview feature).

· Retrieval of a copy of all predecessors and successors of a node (preview feature).

C++ 11

C++11 move operations help avoiding unnecessary data copies. Intel TBB 4.4 adds move-aware insert and emplace methods to concurrent_unordered_map and concurrent_hash_map containers. concurrent_vector::shrink_to_fit was optimized for types that support C++11 move semantics.

tbb::enumerable_thread_specific container has added a move constructor and an assignment operator. Thread-local values can now be constructed from an arbitrary number of arguments via a constructor that uses variadic templates.

The tbb/compat/thread header was updated to automatically include C++11 where available. Exact exception propagation is enabled for Intel C++ Compiler on OS X*.

 

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb.

Update 1

Release Notes

 

The updated version (Open Source release only) contains these additions

  • lambda-friendly overloads for parallel_scan.
  • support of static and simple partitioners in parallel_deterministic_reduce.

We also introduced a few preview features:

  • initial support for Flow Graph Analyzer to do parallel_for.
  • reservation support in overwrite_node and write_once_node.

Bugs fixed

  • Fixed a potential deadlock scenario in the flow graph that affected Intel TBB 2018 Initial Release.
Update 1

Release Notes

 

The updated version (Open Source release only) contains these additions:

  • lambda-friendly overloads for parallel_scan.
  • support of static and simple partitioners in parallel_deterministic_reduce.

We also introduced a few preview features:

  • initial support for Flow Graph Analyzer to do parallel_for.
  • reservation support in overwrite_node and write_once_node.

Bugs fixed

  • Fixed a potential deadlock scenario in the flow graph that affected Intel TBB 2018 Initial Release.
Update 1

Release Notes

 

The updated version (Open Source release only) contains these additions:

  • lambda-friendly overloads for parallel_scan.
  • support of static and simple partitioners in parallel_deterministic_reduce.

We also introduced a few preview features:

  • initial support for Flow Graph Analyzer to do parallel_for.
  • reservation support in overwrite_node and write_once_node.

Bugs fixed

  • Fixed a potential deadlock scenario in the flow graph that affected Intel TBB 2018 Initial Release.
For more complete information about compiler optimizations, see our Optimization Notice.