What's New? Intel® Threading Building Blocks 2017

One of the best known C++ threading libraries Intel® Threading Building Blocks (Intel® TBB) was recently updated to a new release 2017. The updated version contains several key new features when compared to the previous 4.4 release. Some of them were already released in Intel® TBB 4.4 updates.

Licensing

Like Intel® TBB 2.0, the Intel® TBB coming in 2017 brings both technical improvements and becomes more open with the switch to an Apache* 2.0 license, which should enable it to take root in more environments while continuing to simplify effective use of multicore hardware. 

Parallel algorithms

static_partitioner

Intel® TBB 2017 has expanded a set of partitioners with the tbb::static_partitioner. It can be used in tbb::parallel_for and tbb::parallel_reduce to split the work uniformly among workers. The work is initially split into chunks of approximately equal size. The number of chunks is determined at runtime to minimize the overhead of work splitting while providing enough tasks for available workers. Whether these chunks may be further split is unspecified. This reduces overheads involved when the work is originally well-balanced. However, it limits available parallelism and, therefore, might result in performance loss for non-balanced workloads.

Tasks

Added tbb::task_arena::max_concurency() method returning the maximal number of threads that can work inside an arena. The amount of concurrency reserved for application threads at tbb::task_arena construction can be set to any value between 0 and the arena concurrency limit.   

Namespace tbb::this_task_arena is a concept to collect information about arena where the current task is executed now. It is propagated with new functionality:

  • In previous releases to get a current thread slot index in the current arena a tbb::task_arena::current_thread_index() static method was used. Now it is deprecated and functionality was moved to tbb::this_task_arena. Use tbb::this_task_arena::current_thread_index() function now.
  • added this_task_arena::max_concurrency() that returns maximum number of threads that can work on the current arena.
  • (Preview Feature) Use tbb::this_task_arena::isolate() function to isolate execution of a group of tasks or an algorithm from other tasks submitted to the scheduler.

Memory Allocation

Improved dynamic memory allocation replacement on Windows* OS to skip DLLs for which replacement cannot be done, instead of aborting.

For 64-bit platforms, quadrupled the worst-case limit on the amount of memory the Intel® TBB allocator can handle.

Intel® TBB no longer performs dynamic replacement of memory allocation functions for Microsoft Visual Studio 2008 and earlier versions.

Flow Graph

async_node

 

Now it’s a fully supported feature.

 

The tbb::flow::async_node is re-implemented using tbb::flow::multifunction_node template. This allows to specify a concurrency level for the node.

A class tmplate tbb::flow::async_node allows users to coordinate with  an activity that is serviced outside of the Intel® TBB thread pool. If your flow graph application needs to communicate to a separate thread, runtime or device, tbb::flow::async_node might be helpful. It has interfaces to commit results back, maintaining two-way asynchronous communication between a Intel® TBB flow graph and an external computing entity. tbb::flow::async_node class was a preview feature in Intel® TBB 4.4.

 

async_msg

Since Intel TBB 4.4 Update 3 a special tbb::flow::async_msg message type was introduced to support communications between the flow graph and external asynchronous activities.

opencl_node

Streaming workloads to external computing devices is significantly reworked in this Intel® TBB 2017 and introduced as a preview feature. Intel® TBB flow graph now can be used as a composability layer for heterogeneous computing.

A class template tbb::flow::streaming_node was added to the flow graph API. It allows a flow graph to offload computations to other devices through streaming or offloading APIs. The “streaming” concept uses several abstractions like StreamFactory to produce instances of computational environments, kernel to encapsulate computing routine, device_selector to access a particular device.

The following example shows a simple OpenCL* kernel invocation.

File sqr.cl

__kernel
void Sqr( __global float *b2, __global float *b3   )
{
    const int index = get_global_id(0);
    b3[index] = b2[index]*b2[index];
}

File opencl_test.cpp

#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#define TBB_PREVIEW_FLOW_GRAPH_FEATURES 1

#include <iterator>
#include <vector>
#include "tbb/flow_graph_opencl_node.h"
using namespace tbb::flow;

bool opencl_test()   {
   opencl_graph g;    
   const int N = 1 * 1024 * 1024;
   opencl_buffer<float>  b2( g, N ), b3( g, N );
   std::vector<float>  v2( N ), v3( N );

   auto i2 = b2.access<write_only>();
   for ( int i = 0; i < N; ++i ) {
        i1[i] = v1[i] = float( i );
   }
   // Create an OpenCL program
   opencl_program<> p( g, PathToFile("sqr.cl") ) ;
   // Create an OpenCL computation node with kernel "Sqr" 
   opencl_node <tuple<opencl_buffer<float>, opencl_buffer<float>>> k2( g, p.get_kernel( "Sqr" ) );
   // define iteration range
   k2.set_range( {{ N },{ 16 }} );
   // initialize input and output buffers
   k2.try_put( std::tie( b2, b3 ) );
   // run the flow graph computations
   g.wait_for_all();

    // validation
    auto o3 = b3.access<read_only>();
    bool comp_result = true;
    for ( int i = 0; i < N; ++i ) {
    	 comp_result &&= (o3[i] - v2[i] * v2[i]) < 0.1e-7;
    }
    return comp_result;
 }

 

Some other improvements in the Intel® TBB flow graph

  • Removed a few cases of excessive user data copying in the flow graph.
  • Reworked tbb::flow::split_node to eliminate unnecessary overheads.

Important note: Internal layout of some flow graph nodes has changed; recompilation is recommended for all binaries that use the flow graph.

 

Python

An experimental module which unlocks additional performance for multi-threaded Python programs by enabling threading composability between two or more thread-enabled libraries.

Threading composability can accelerate programs by avoiding inefficient threads allocation (called oversubscription) when there are more software threads than available hardware resources.

The biggest improvement is achieved when a task pool like the ThreadPool from standard library or libraries like Dask or Joblib (used in multi-threading mode) execute tasks calling compute-intensive functions of Numpy/Scipy/PyDAAL which in turn are parallelized using Intel® Math Kernel Library (Intel® MKL) or/and Intel® TBB.

The module implements Pool class with the standard interface using Intel® TBB which can be used to replace Python’s ThreadPool. Thanks to the monkey-patching technique implemented in class Monkey, no source code change is needed in order to unlock additional speedups.

For more details see: Unleash parallel performance of python programs 

Miscellaneous

  • Added TBB_USE_GLIBCXX_VERSION macro to specify the version of GNU libstdc++ when it cannot be properly recognized, e.g. when used with Clang on Linux* OS.
  • Added support for C++11 move semantics to the argument of tbb::parallel_do_feeder::add() method.
  • Added C++11 move constructor and assignment operator to tbb::combinable class template.

 

Samples

All examples for commercial version of library moved online: https://software.intel.com/en-us/product-code-samples. Examples are available as a standalone package or as a part of Intel(R) Parallel Studio XE or Intel(R) System Studio Online Samples packages

Added graph/stereo example to demostrate tbb::flow::async_msg, and tbb::flow::opencl_node.

 

You can download the latest Intel TBB version from http://threadingbuildingblocks.org and https://software.intel.com/en-us/articles/intel-tbb

For more complete information about compiler optimizations, see our Optimization Notice.