• 04/03/2020
  • Public Content
Contents

General Recommendations on User Kernels

The most straightforward way to combine some legacy code with the OpenVX* is to execute an OpenVX graph, wait for completion, access its data on the host, and execute the user code on the data. Then potentially continue with OpenVX execution and so on. For OpenCV interoperability examples refer to the following sections:
Sometimes it makes sense to include your own code as graph nodes. Below are some user kernel tips:
  • Just like with any other kernel in a graph, there might be multiple node instances for the same user-defined kernel. These instances might be potentially executed in parallel. So your user-kernel code should be thread safe.
  • User kernels may act as implicit memory barriers. In most cases the OpenVX implementation should complete all prior computations and update memory objects that the user kernel depends on. This is not entirely performance friendly. So if you see a user kernel in the hotspots (Performance Analysis Using Intel® VTune™ Amplifier topic), consider the following:
    • If the kernel is heavy enough, leverage parallelism
      within
      the user kernel, for example by making the code parallel. Using Intel® Threading Building Blocks is strongly recommended as an API of choice for parallelism, as it would guarantee composability with the rest of OpenVX runtime. Specifically, using the TBB within a user kernel helps avoiding threading oversubscription.
    • Consider the Advanced Tiling extension (the Intel Extensions to the OpenVX* API: Advanced Tiling chapter). This way you specify the tiled version of the user kernel and let the runtime to schedule the things in the best way, including parallelism.
    • Merging kernels so that they combine steps on local data (kernel fusion) often provides better performance than any automatic approach.
Finally, if your legacy code is in OpenCL™, it is relatively straightforward to wrap calls to the OpenCL API into the user kernels. Just like other types of user code in OpenVX, beware of potential issues when multiple instances of the user kernel might exist in parallel. For example, your code would need to keep separate
cl_kernel
instances per each user node. This avoids clash from setting different parameters to the same
cl_kernel
from multiple threads.  Also, overheads associated with user nodes that use OpenCL might be more pronounced, since communication with GPU is involved. Eventually, this approach lacks data sharing, as OpenCL context and queues are not shared between OpenVX run-time and your code.
To summarize, instead of wrapping OpenCL calls with user kernels in OpenVX, consider using Custom OpenCL kernels (section Intel Extensions to the OpenVX* API: OpenCL™ Custom Kernels) for leveraging existing OpenCL kernels.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804