OPTIMIZE MULTITHREADED PERFORMANCE

In modern multicore systems, threaded performance is critical for exploiting the full potential of the processor. Intel® VTune™ Amplifier helps you tune your software to make effective use of all cores.

screenshot of the locks and waits analysis interface
Figure 1

Find Common Causes of Slow Threaded Code
 

The Locks and Waits analysis helps you focus your tuning efforts and envision potential improvements. Use it to identify synchronization objects (locks) that prevent effective processor utilization and to estimate the impact and wait time each lock has on application performance.

See a prioritized list of synchronization objects that negatively impact performance (see Fig. 1).

Tune Parallel Performance
 

Intel® VTune™ Amplifier has the built-in ability to discern parallel programming models (including OpenMP* 4.0 and Intel® Threading Building Blocks) making it easy to visualize and understand multithreading concepts such as a task beginning and ending, synchronizing, and waiting. Get the data you need to tune performance and see which parallel regions are inefficient and why (for example, imbalance, lock contention, and communication).

Detailed data for each OpenMP region highlights tuning opportunities (see Fig. 2).

screenshot of the Open M P tuning interface
Figure 2

screenshot of the threading timeline interface
Figure 3

Visually Spot Inefficient Threading

Use the timeline to spot patterns of inefficient threading (like coarse-grained locks). Figure 3 shows multiple threads, but only one thread (dark green) runs at a time. No work is done in parallel due to data sharing issues. The timeline lets you visually spot threading inefficiencies. In this example, there are four threads, but only one is running at any given time, so thread concurrency is very low.

See Lock Contention

Another common threading performance issue is when multiple threads contend for the same lock. This becomes obvious when the timeline is dominated by yellow transition lines. A high density of transitions may indicate lock contention and poor parallel performance (see Fig. 4).

screenshot of lock contention interface
Figure 4

Additional Capabilities