Intel® Parallel Amplifier is an excellent tool for identifying hotspots and measuring CPU utilization. Using Amplifier’s Concurrency analysis it’s very easy to find the places in an application that poorly utilize the CPU, but root-causing these issues is often more complex. To do this, you need to understand the runtime behavior of the application - how many threads are actually running, how the work is distributed between the threads and where thread execution is “serialized”. In this post, I'll show how to use Intel Parallel Amplifier to analyze thread dependencies and how this information can be used to improve the overall performance of the application. If you want to follow along, you can download a free evaluation copy of Intel Parallel Amplifier from /en-us/intel-parallel-studio-home
The source for obtaining this information is Amplifier’s Locks and Waits analysis. This analysis identifies all the places where application threads execute "blocking calls" (e.g., waiting for locks, I/O operations, thread objects etc), calculates the wait times and wait counts, and attributes them to Synchronization Objects. For example, a wait for a Mutex is attributed to a Mutex object, a wait for an I/O operation is attributed to a Stream object and so on. When a thread waits for (joins) another thread, the wait time is attributed to a Thread object which represents the thread being waited on. To keep everything clear and simple, we'll refer to such a thread as a Blocking thread and to threads waiting for a Blocking thread as Blocked threads or simply Waiting threads.
To show how this information is used for understanding the runtime relationships between threads, we’ll use a simple application (you can download the source code here - deps.cpp). Our application’s main thread creates two worker threads (ShortTask and LongTask) and waits for them to finish. The dependencies between the threads are shown in the following diagram:
Let’s see how we can use Amplifier’s "Locks and Waits" analysis to figure out these dependencies. First, let’s see how many threads the application has. This is easy – everything is nicely summarized in the Wait Thread filter at the bottom of the screen:
We see that our application has three threads and that the total wait time is distributed between the main and the ShortTask threads. We also see that the LongTask thread does not wait for any other thread since its contribution to the total wait time is 0%.
Next, we need to better understand the relationship between the threads. To do this, we’ll use the data view and the Call Stack view to examine both the waiting threads and the blocking threads. We’ll start the analysis from the blocking threads and work backwards to identify the waiting threads.
To view the blocking threads, switch to the Sync Object Name - Wait Thread – Wait Function grouping in the data view:
This tells Intel Parallel Amplifier to present the information grouped by Synchronization Objects, then by the Waiting Threads, then by the Waiting Functions and so on. You get the idea...
To identify the blocking threads:
- Select the Thread objects you want to examine (e.g., Thread 0x7f107fbe).
- From the drop-down list at the top of the Call Stack view, select the Signaling (User) entry (shown above). The bottom-most function in this view is the thread’s entry point (e.g., the LongTask function). This means that the Thread 0x7f107fbe object is actually the LongTask thread.
- Using the same technique, we'll find out that Thread 0x046d34ff is actually the ShortTask thread.
To find out where a blocking thread is created, select the Object Creation (User) entry from the Call Stack View’s drop-down list. The top-most function in this view is the location where the thread is created. The entry point of the creating thread appears at the bottom (see below).
We can now summarize the information in the following diagram:
Finding the waiting threads is straightforward. The information appears in the second level of the tree view:
- There are two threads waiting for Thread 0x7f107fbe (LongTask):
- The main thread (mainCRTStartup) is waiting for ~3.6 seconds
- The ShortTask thread is waiting for ~3.6 seconds
- The main thread is also waiting on Thread 0x046d34ff (ShortTask) for ~.4 seconds
Let's add this information to our diagram:
Here are some important observations we can make by looking at the diagram:
Most of the application execution time is spent in the LongTask thread. Therefore it is an obvious candidate for further optimizations such as splitting LongTask to multiple short tasks that can run in parallel.
The dependency between the LongTask and the ShortTask threads adds ~10% to the total execution time of these threads. Removing this dependency will enable these threads to run in parallel and will reduce the total execution time. We can certainly benefit from pondering on this dependency and check whether it is really needed.
To summarize, the Locks and Waits analysis is useful for understanding the runtime behavior of the application and identifying the thread dependencies which affect the overall performance.
- The sample source code is available here - deps.cpp
- A free evaluation copy of Intel Parallel Amplifier can be downloaded from /en-us/intel-parallel-studio-home
- By installing or copying all or any part of the software components in this site, you agree to the terms of the Intel Sample Source Code License Agreement