• 04/03/2020
  • Public Content

General Tips on Heterogeneous Execution

  • As discussed in the Striving for Performance section, deducing performance conclusions from the execution time of individual kernels might be misleading. In most cases, the larger subgraph you charge an accelerator with, the better the communication costs are amortized.
  • Generally, GPU performance is better on large images. So if the amount of work is too small (<1ms of execution time) - run the graph on CPU device instead or fuse kernels.
  • Notice that using the GPU target introduces one-time overhead (order of few seconds) of compiling the OpenCL™ kernels. The compilation happens upon OpenVX* context creation and does not affect the execution.
  • A typical strategy to start with is to test the CPU-only and GPU-only scenarios first (section 9.2). Beware of the situations when some nodes are not supported by the particular target (refer to the Kernel Extensions document for the kernels support matrix). In this case, the only way is to schedule nodes individually and search for optimal split by scheduling subgraphs (or independent branches in the graph) to different targets.
  • For scenarios where CPU and GPU targets are mixed in the graph, it is recommended to try the option of enabling the GPU tiling (which is set to
    by default). That might unleash the additional (data-) parallelism between two devices:
    $ export VX_CL_TILED_MODE=1
    For the GPU-only scenarios, the option should be definitely reverted back to OFF.
  • It is advised to do performance analysis (next chapter) to determine “hotspot” nodes which should be first candidates for offloading to the additional targets. At the same time it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other run-time overheads.
  • Notice that GPU can be busy with other tasks (like rendering), similarly the CPU can be in charge for the general OS routines.
  • Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease even in compare to single-device scenario.
  • Similarly, even in the GPU-only scenario, a high interrupt rate and frequent synchronization with the host can raise the frequency of the CPU and drag the frequency of GPU down.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804