Use the Threading analysis to identify how efficiently an application uses available processor compute cores and explore inefficiencies in threading runtime usage or contention on threading synchronization that makes threads waiting and prevents effective processor utilization.
Threading analysis combines and replaces the Concurrency and Locks and Waits analysis types available in previous versions of Intel® VTune™ Amplifier.
Intel® VTune™ Amplifier uses the Effective CPU Utilization metric as a main measurement of threading efficiency. The metric is built on how an application utilizes the available logical cores. For throughput computing, it is typical to load one logical core per physical core.
The following aspects of Threading Analysis provide possible reasons for poor CPU utilization:
- Thread count: a quick glance at the application thread count can give clues to threading inefficiencies, such as a fixed number of threads that might prevent the application from scaling to a larger number of cores or lead to thread oversubscription
- Wait time (trace-based or context switch-based): analyze threads waiting on synchronization objects or I/O
- Spin and overhead time: estimate threading runtime overhead or the impact of spin waits (busy or active waits)
The Threading Analysis provides two collection modes with major differences in thread wait time collection and interpretation:
- User-Mode Sampling and Tracing, which can recognize synchronization objects and collect thread wait time by objects using tracing. This is helpful in understanding thread interaction semantics and making optimization changes based on that data. There are two groups of synchronization objects supported by Intel VTune Amplifier: objects usually used for synchronization between threads (such as Mutex or Semaphore) and objects associated with waits on I/O operations (such as Stream).
- Hardware Event-Based Sampling and Context Switches, which collects thread inactive wait time based on context switch information. Even though there is not a thread object definition in this case, the problematic synchronization functions can be found by using the wait time attributed with call stacks with lower overhead than the previous collection mode. The analysis based on context switches also shows thread preemption time, which is useful in measuring the impact of thread oversubscription on a system.
How It Works: User-Mode Sampling and Tracing
With user-mode sampling and tracing collection, VTune Amplifier instruments threading and blocking API intercepting the calls during runtime and building thread interaction flow detecting synchronization objects. Using User-mode Sampling and Tracing Collection analysis mode you can estimate the impact each synchronization object has on the application and understand how long the application had to wait on each synchronization object, or in blocking APIs. The analysis shows the thread interaction with execution flow transition from one thread to another with releasing and accruing synchronization objects on the timeline view.
If this mode brings significant overhead in the application runtime, try the Hardware Event-Based Sampling and Context Switches mode, which offers a less intrusive method of wait time collection.
How It Works: Hardware Event-Based Sampling and Context Switches
Multitask operating systems execute all software threads in time slices (thread execution quanta). In the Hardware Event-Based Sampling and Context Switches mode, the profiler gains control whenever a thread gets scheduled on and then off a processor (that is, at thread quantum borders). This mode also determines a reason for thread inactivation, which includes an explicit request for synchronization or thread quantum expiration (when the operating system scheduler preempts the current thread to run a higher-priority thread instead).
The time during which a thread remains inactive is measured and called Inactive Wait Time. Inactive Wait Time is differentiated based on the reason for inactivity:
- Inactive Sync Wait Time is caused by a request for synchronization
- Preemption Wait Time is caused by preemption
Since context switch information is collected with call stacks, it is possible to explore reasons of Inactive Wait Time by wait functions with their call paths. The Hardware Event-Based Sampling and Context Switches mode shows the places in the code where the wait was induced by a synchronization object or I/O operation.
The Hardware Event-Based Sampling and Context Switches mode is based on the hardware event-based sampling collection and analyzes all the processes running on your system at the moment, providing context switching data on whole system performance. On Linux* systems, Inactive Wait Time Collection is available in driverless Perf*-based collection usage with kernel version 4.4 or later. Inactive Time reasons are available in kernel 4.17 and later.
On 32-bit Linux* systems, the VTune Amplifier uses a driverless Perf*-based collection for the hardware event-based sampling mode.
Configure and Run Analysis
To configure options for the Threading analysis:
Prerequisites: Create a project and specify an analysis target.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel® VTune™ Amplifier toolbar.
The Configure Analysis window opens.
From HOW pane, click the Browse button and select Threading.
Configure the collection options.
User-Mode Sampling and Tracing mode
Select to enable the user-mode sampling and tracing collection for synchronization object analysis. This collection mode uses a fixed sampling interval of 10ms. If you need to change the interval, click the Copy button and create a custom analysis configuration. For intervals less than 10ms, use the Hardware Event-Based Sampling and Context Switches mode.
Hardware Event-Based Sampling and Context Switches mode
Select to enable hardware event-based sampling and context switches collection.
You can configure the CPU sampling interval, ms to specify an interval (in milliseconds) between CPU samples. Possible values for thehardware event-based sampling mode are 0.01-1000. 1 ms is used by default.
When changing collection options, pay attention to the Overhead diagram on the right. It dynamically changes to reflect the collection overhead incurred by the selected options.
Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify or enable additional settings for the analysis, you need to create a custom configuration by copying an existing predefined configuration. VTune Amplifier creates an editable copy of this analysis type configuration.
You may generate the command line for this configuration using the Command Line button at the bottom.
Click the Start button to run the analysis.
The Threading analysis results appear in the Threading Efficiency viewpoint, which consists of the following windows/panes:
Summary window displays statistics on the overall application execution, identifying CPU time and processor utilization.
Bottom-up window displays hotspot functions in the bottom-up tree, CPU time and CPU utilization per function.
Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
Caller/Callee window displays parent and child functions of the selected focus function.
Platform window provides details on CPU and GPU utilization, frame rate, memory bandwidth, and user tasks (if corresponding metrics are collected).
Start on the result Summary window to explore the Effective CPU utilization of your application and identify reasons for underutilization connected with synchronization, parallel work arrangement overhead, or incorrect thread count. Click links associated with flagged issues to be taken to more detailed information. For example, clicking a sync object name in the Top Waiting Objects table takes you to that object in the Bottom-up window.
Analyze thread integration synchronization objects with wait and signal stacks and transitions on the timeline. Explore CPU time spent in threading runtimes to classify inefficiencies in their use.
Modify your code to remove CPU utilization bottlenecks and improve the parallelism of your application.
Concentrate your tuning on objects with long Wait time where the system is poorly utilized (red bars) during the wait. Consider adding parallelism, rebalancing, or reducing contention. Ideal utilization (green bars) occurs when the number of running threads equals the number of available logical cores.
Re-run the analysis to verify your optimization with the comparison mode and identify more possible areas for improvement.
For more information and interpretation tips, see Threading Efficiency View.