• 04/03/2020
  • Public Content
Contents

Striving for Performance

Intel OpenVX* Performance Promise
An OpenVX* approach to the performance extends the conventional one-off function acceleration with the notion of graphs. Graphs expose optimization possibilities that might not be available or obvious with traditional approaches. For example, with the Intel OpenVX implementation, different kernels that share data are not forced to use global (slow) memory. Rather, automatic tiling fits data to cache. Similarly, while parallelism itself is not directly expressed in the graph, the independent data flows are extracted from its structure.
OpenVX also creates a logical model where IP blocks of the Intel SoC fully share system resources such as memory; the code can be scheduled seamlessly on the block that is best able to execute it.
In addition to the global graph-level optimizations, performance of the OpenVX vision functions is also resolved via use of optimized implementation with a strong focus on a particular platform. For the CPU, this is leveraged through Intel® Integrated Performance Primitives (Intel® IPP), which has code branches for different architectures. For the GPU, the matured stack of OpenCL* Just-In-Time compilation to the particular architecture is used.
To achieve good performance the trade-offs implicit to the OpenVX model of computation must be well understood. This sections describes general considerations for OpenVX with respect to performance.
Use OpenVX* Graph Mode to Estimate Performance
OpenVX supports a single-function execution model called immediate mode.
NOTE: Notice that use of immediate mode flavors of the vision functions (prefixed with
vxu*
, for example,
vxuColorConvert
) still implies using graphs (each comprising just a single function) behind the scene. Thus, graph verification and other costs like memory allocation will be included in the timing, and not amortized over multiple nodes/iterations.
Still the immediate mode can be useful as an intermediate step, for example, when porting an application from OpenCV to OpenVX (see the Example Interoperability with OpenCV section).
Beware of Graph Verification Overheads
The graph verification step is a heavy-weight operation and should be avoided during “tight-loop” execution time. Notice that changing the meta-data (for example, size or type) of the graph inputs might invalidate the graph. Refer to the Map/Unmap for OpenVX* Images section for some tips on updating the data.
Comparing OpenVX Performance to Native Code
When comparing OpenVX performance with native code, for example, in C/C++ or OpenCL, make sure that both versions are as similar as possible:
  • Wrap exactly the same set of operations.
  • Do not include graph verification when estimating the execution time. Graph verification is intended to be amortized over multiple iterations of graph execution.
  • Track data transfer costs (reading/writing images, arrays, and so on.) separately. Also, use data mapping when possible, since this is closer to the way a data is passed in a regular code (by pointers).
  • Demand the same accuracy.
Enabling Performance Profiling per Node
So far, we discussed overall performance of the graph. In order to get the
per-node
performance data, OpenVX* 1.1 spec explicitly mandates enabling of the performance profiling in the application. There is a dedicated directive for that:
vx_status res = vxDirective(context, VX_DIRECTIVE_ENABLE_PERFORMANCE);
NOTE: Per-node performance profiling is enabled on the per-context basis. As it might introduce certain overheads, disable it in the production code and/or when measuring overall graph performance.
When the profiling is enabled, you can get performance information for a node:
vx_perf_t perf; vxQueryNode(node, VX_NODE_ATTRIBUTE_PERFORMANCE, &perf, sizeof(perf)); printf(“Average exec time for the %h node: %0.3lfms\n", node, (vx_float32)perf.avg/1000000.0f);
For example, refer to the
Color Copy Pipeline Sample
in the Sample Applications section.
NOTE:Notice that to get the performance data for nodes running on the GPU, you need to set the following environment variable:
$export VX_CL_PERFORMANCE_REPORTING=1
General Rules of Thumb

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804