User Guide

Contents

performance_analysis Samples

Forward Substitution with Trace

The
forward_substitution.graphml
sample shows the topology and behavior of a Threading Building Block (TBB) flow graph application that provides an implementation of forward substitution on a lower-triangular matrix. The trace is for a single execution of the graph, using 4 threads for a 8192x8192 matrix with a block size of 128. The runtime trace of the application is contained in the matching
forward_substitution.traceml
file. This matching file is loaded automatically by the Flow Graph Analyzer.
forward substitution on a lower-triangle matrix

Feature Detection with Trace

The
feature_detection.graphml
sample shows the topology and behavior of a Threading Building Blocks (TBB) flow graph application based on the example described in the blog posting at https://software.intel.com/en-us/blogs/2011/09/09/a-feature-detection-example-using-the-intel-threading-building-blocks-flow-graph.
This trace was collected using 8 threads and 32 buffers provided to the buffer queue. The concurrency varies over time, but is limited to 8 threads at most.

Computer Vision with Trace

The
computer_vision.graphml
sample shows the topology and behavior of a Threading Building Blocks (TBB) flow graph application that represents a classic example of data flow parallelism. It is composed of three different computer vision (CV) algorithms that process the same input data. The data is a video input stream, and you can observe a resulting regular pattern in the timeline chart (the trace contains around 20 frames).
Notice the following:
Red outlined area #1
You can use the critical path calculation functionality (turquoise box) to identify bottlenecks in the data flow. As a result of this feature, all nodes on the critical path are highlighted.
White box in the #2 area
Zoom in the timeline to analyze a single frame execution in detail. The frame execution flow is the following:
  1. The source node spawns a task. This is the first stage of the image processing pipeline.
  2. A limiter node is used to balance the pipeline. It forwards the frame only if the number of frames that are currently executed is below a user-specified threshold.
  3. Three different algorithms are executed in parallel. Concurrency changes during the algorithm stage because less work is available. In the timeline, a high concurrency is colored in green.
Lower part of red outlined area #2
For a Threading Building Blocks (TBB) flow graph, an external activity can be encapsulated in a predefined async node. This activity represents offloading work to an Accelerator (for example, FPGA, GPU). The beginning and end of this activity are displayed as green vertical lines in the timeline. You can find a single execution within a single frame for each CV algorithm (represented by the nodes CV serial, CV nested, CV async). CV nested represents a node with a nested TBB parallel for algorithm that consumes most of the CPU time on average.
Red outlined area #3
The Treemap shows the average node weight.
CV_nested
includes a TBB
parallel_for
algorithm and consumes most of the CPU time.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804