opencl_node basic interfaces and opencl_buffer

This post continues a series of articles that describes the opencl_node, a new node available in the Intel® Threading Building Blocks (Intel® TBB) library since version 4.4 Update 2. This node allows OpenCLTM powered devices to be more easily utilized and coordinated by an Intel TBB flow graph. The first article in this series can be found here.

opencl_node basic interfaces

The generic declaration of opencl_node is

template< typename... PortTypes, typename Policy = queueing,  typename Factory = default_opencl_factory >
class opencl_node< tuple<PortTypes...>, Policy, Factory >;

opencl_node has a matching set of input and output ports, with one port for each type specified by tuple<PortTypes...>. It is the only mandatory template argument.

opencl_node supports buffering policies, similar to join_node. The valid policies for opencl_node are queueing (the default one) and key_matching.

The third template argument is Factory that abstracts OpenCL context and allows us to specify devices for kernel execution. By default, opencl_node is instantiated with default_opencl_factory which, currently, chooses the first device in the first OpenCL platform available on the system (this behavior might be changed in future versions).

When a message is put to an input port, the data associated with the message becomes available on a device. The OpenCL kernel is executed whenever all of the input ports have received messages. Messages might be put to input ports at different rates, but the total number of messages must be the same for each port. At some point after the kernel is submitted messages will be forwarded to successors.

To create an instance of opencl_node, at least opencl_graph, opencl_program and a kernel name must be provided at construction time:

template <typename DeviceSelector>
opencl_node( opencl_graph &g,
    const opencl_program<Factory> &p,
    const std::string &kernel_name,
    DeviceSelector d = /* default device selector */,
    Factory &f = /* default OpenCL factory */ );

opencl_node accepts only one kernel name, i.e. it is a flow graph abstraction of one (and only one) OpenCL kernel. If the application needs to use more than one kernel it can create several opencl_node objects and set the dependencies between them with the make_edge interface. For example:

#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#include "tbb/flow_graph_opencl_node.h"

#include <numeric>

int main() {
    using namespace tbb::flow;
    
    opencl_graph g;
    opencl_node<tuple<cl_int>> cl1( g, "simple_dependency.cl", "k1" );
    opencl_node<tuple<cl_int>> cl2( g, "simple_dependency.cl", "k2" );
    opencl_node<tuple<cl_int>> cl3( g, "simple_dependency.cl", "k3" );
        
    make_edge( output_port<0>(cl1), input_port<0>(cl2) );
    make_edge( output_port<0>(cl1), input_port<0>(cl3) );
 
    cl1.set_ndranges( { 1 } );
    cl2.set_ndranges( { 1 } );
    cl3.set_ndranges( { 1 } );
    input_port<0>(cl1).try_put( 0 );

    g.wait_for_all();
    
    return 0;
}

simple_dependency.cl:
kernel void k1( int b ) { printf("kernel #1\n"); }
kernel void k2( int b ) { printf("kernel #2\n"); }
kernel void k3( int b ) { printf("kernel #3\n"); }

For this code, the output log can be either

kernel #1
kernel #2
kernel #3

or

kernel #1
kernel #3
kernel #2

The graph is constructed in such a way that the second and third kernels depend on the first one but are independent of each other:

make_edge( output_port<0>(cl1), input_port<0>(cl2) );
make_edge( output_port<0>(cl1), input_port<0>(cl3) );

Thus, it is guaranteed that the kernel k1 will always be executed first, while ordering of the second and third kernels is non-deterministic.

opencl_buffer

The OpenCL kernel can work with special memory objects allocated on the host. The flow graph provides the opencl_buffer template class that is an abstraction over a strongly typed linear array:

template <typename T, typename Factory = default_opencl_factory>
class opencl_buffer {
public:
    typedef /* implementation-defined */ iterator;
    
    // Data accessors
    T* data();
    iterator begin();
    iterator end();
    T& operator[] ( ptrdiff_t k );
    size_t size();
    
    // Constructor
    opencl_buffer( Factory &f, size_t size );
    opencl_buffer( opencl_graph &g, size_t size );
};

To create a buffer, either an opencl_factory or an opencl_graph is required. In the latter case, the buffer is created with the default_opencl_factory (that represented with the opencl_graph object). In addition, the number of elements in the array should also be provided, e.g.:

opencl_graph g;
const int N = 1000;
opencl_buffer<cl_int> buf( g, N );

The above example creates an array of 1000 ints. To fill the buffer with the required data, the data accessors can be used on the host, e.g.:

std::iota( buf.begin(), buf.end(), 0 );

This fills the buffer with values from 0 to 999. To process the buffer in parallel on OpenCL devices the ndrange of the same size is used:

std::vector<int> ndrange( 1, N );
clNode.set_ndranges( ndrange );

In contrast with previous examples, in this example std::vector is used as the ndrange. The vector contains one element equal to N. When passed to set_ndrange, it is considered as one-dimensional range [0,N).

The set_ndranges function accepts any container which provides begin() and end() methods. Many standard C++ types provide these methods, including std::initializer_list, std::vector, std::array. The resulting range has as many dimensions as the number of elements in the container, with each dimension size set to the corresponding element value.

The whole example is:

#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#include "tbb/flow_graph_opencl_node.h"

#include <numeric>
#include <cassert>

int main() {
    using namespace tbb::flow;
    
    opencl_graph g;
    const int N = 1000;
    opencl_buffer<cl_int> buf( g, N );
    std::iota( buf.begin(), buf.end(), 0 );
    
    opencl_node<tuple<opencl_buffer<cl_int>>> 
        clNode( g, "opencl_buffer.cl", "increment" );
            
    std::vector<int> ndrange( 1, N );
    clNode.set_ndranges( ndrange );
    input_port<0>(clNode).try_put( buf );
    
    g.wait_for_all();
    
    assert( std::accumulate( buf.begin(), buf.end(), 0 ) == N*(N+1)/2 );
    
    return 0;
}
opencl_buffer.cl:
kernel void increment ( global int* arr ) { 
    const int i = get_global_id(0);
    arr[i] += 1;
}

In the example the array of ints is filled from 0 to 999, then each element is incremented by 1 in parallel using the opencl_node, and the result is checked in the assert.

In the next article in this series I will discuss selecting devices to use for execution of a kernel.

For more complete information about compiler optimizations, see our Optimization Notice.