opencl_node overview

Introduction

The Intel® Threading Building Blocks (Intel® TBB) library provides a set of algorithms that enables parallelism in C++ applications. Since Intel TBB 4.0, unstructured parallelism, dependency graphs and data flow algorithms can be expressed with flow graph classes and functions. The flow graph makes Intel TBB useful for cases that are not covered by its generic parallel algorithms, while keeping users away from lower-level peculiarities of its tasking API.

Increasingly, systems are becoming heterogeneous and incorporate not only the power of CPUs but also different kinds of accelerators that are suitable for particular sets of tasks. To address this changing landscape and to effectively program heterogeneous systems, Intel TBB 4.4 extended the flow graph interface with the async_node to allow flow graphs to efficiently communicate with an external activity managed by the user or another runtime. In addition, Intel TBB 4.4 Update 2 introduced the opencl_node, which is more specific than async_node and enables OpenCL™ powered devices to be utilized and coordinated by a flow graph. Currently, both async_node and opencl_node are Preview features.

In a series of articles, I aim to give an overview in some detail of the opencl_node functionality. I start with the simplest “Hello, World!” example and consider it step-by-step. I then demonstrate the basic interfaces of the opencl_node, show how to work with memory objects and choose a certain OpenCL device, and cover the interfaces in more detail.

Hello, World!

The “Hello, World!” example below demonstrates the opencl_node usage.

hello_world.cpp:
#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#include "tbb/flow_graph_opencl_node.h"

#include <algorithm>

int main() {
    using namespace tbb::flow;
    
    opencl_graph g;
    opencl_node<tuple<opencl_buffer<cl_char>>>
        clPrint( g, "hello_world.cl", "print" );

    const char str[] = "Hello, World!";
    opencl_buffer<cl_char> b( g, sizeof(str) );
    std::copy_n( str, sizeof(str), b.begin() );

    clPrint.set_ndranges( { 1 } );
    input_port<0>(clPrint).try_put( b );

    g.wait_for_all();
    
    return 0;
}
hello_world.cl:
kernel void print( global char *str ) {
    printf("OpenCL says '");
    for ( ; *str; ++str ) printf("%c", *str);
    printf("'\n");
}

To compile the example on Microsoft* Windows* OS, either Microsoft* Visual Studio* 2013 or Intel® Composer XE 2015 is required.

To compile with the Microsoft* C++ Compiler:

>cl /EHsc hello_world.cpp /wd4503 /link OpenCL.lib

To compile with the Intel® C++ Compiler:

>icl /Qstd=c++11 hello_world.cpp /link OpenCL.lib

Note that an OpenCL SDK is required for the example compilation.

In addition, any C++ compiler with C++11 support can be used to compile the example on Windows and other operating systems, e.g, Linux* or OS X*.

After running, the output is

>hello_world.exe
OpenCL says 'Hello, World!'

As expected, the example prints the “Hello, World!” message.

Now let’s look at the code in detail.

To use the opencl_node, the special header file (with the appropriate preview macro defined) should be included:

#define TBB_PREVIEW_FLOW_GRAPH_NODES 1
#include "tbb/flow_graph_opencl_node.h"

In contrast with other flow graph nodes, only a special opencl_graph object can be used to create opencl_node:

opencl_graph g;    
opencl_node<tuple<opencl_buffer<cl_char>>>
    clPrint( g, "hello_world.cl", "print" );

However, this graph object can also be used to create any other flow graph nodes; it is not restricted to just opencl_node objects. The need to use this special kind of graph is required while the opencl_node is still a Preview feature. It is expected that this restriction will be removed later.

In this example, opencl_node is instantiated with a tuple<opencl_buffer<cl_char>> that creates the node with one input port and one output port, both of the opencl_buffer<cl_char> type.  The second and the third arguments are an OpenCL program and a kernel to be extracted from the program, respectively.

An OpenCL kernel can work with memory objects created on the host. To represent a linear array of symbols opencl_buffer<cl_char> is used and filled with symbols from a C-style string:

const char str[] = "Hello, World!";
opencl_buffer<cl_char> b( g, sizeof(str) );
std::copy_n( str, sizeof(str), b.begin() );

Any OpenCL kernel invocation requires an iteration space (called ndrange) over which it is executed. The set_ndranges method is used to set the ndrange for opencl_node in the example:

clPrint.set_ndranges( { 1 } );

In this case, a one dimensional range with size 1 is passed to the kernel.

To run the kernel, the usual flow graph interface try_put is used:

input_port<0>(clPrint).try_put( b );

Although in the current example opencl_node has only one input port and only one output port, it is a multi-input and multi-output node and therefore requires that the input_port and output_port helper functions are used.

try_put is an asynchronous method that initiates the execution but does not wait for completion. For that, the wait_for_all method is called:

g.wait_for_all();

After the main thread returns from wait_for_all, it is guaranteed that the kernel has finished and has sent “Hello, World!” to the standard output.

Further reading:

opencl_node basic interfaces and opencl_buffer

Device selection

opencl_program and argument binding

Ordering issues

Conclusion

These articles cover key principles of how to use opencl_node within an Intel TBB flow graph. The overview does not pretend to be complete: some features are just mentioned and not considered in detail, the examples are intentionally simplified, etc. However I hope it will be useful as a starting point to learn the functionality.

As an extra caution, let me remind you that this functionality is provided for preview and is subject for changes, including incompatible modifications in the API and behavior.

If you have any remarks and suggestions about the overview and opencl_node itself, feel free to leave comments. 

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.