by Clay P. Breshears
Parallel Applications Engineer
In the past, software profiling tools have concentrated on measuring the execution time of functions and procedures within applications. For serial applications this was useful information that guided the programmer to those parts of the code that could most benefit from optimization or threading. For applications that have already been threaded, this type of information often has little value.
Information that relates directly to performance issues from threaded and parallel execution is of much more use. Valuable data that could be used to reduce the overall execution time of a threaded application would include the amount of idle processing time that is spent by threads at implicit barriers at the end of OpenMP* constructs or waiting time to gain access to locks or critical regions, time spent in critical regions of code, and overhead time required to execute iteration scheduling for parallel loops, just to mention a few. The structured nature of parallel and serial regions, and the thread execution model of OpenMP, strictly fork-join, lends itself very easily to the measurement of these and other potential performance problems by Intel® Thread Profiler.
Algorithms coded with explicit threading libraries, such as Win32* Threads, do not have the structured execution that is inherent in OpenMP. In fact, there are very few restrictions for how such threaded applications can be designed with regards to threads and their interactions. Even if a programmer were to mimic the execution of OpenMP threads, an automated tool would have a near impossible task of identifying this fact. Thus, a different method of analyzing the performance of explicitly threaded applications is required. Such a new method has been developed and included in Intel Thread Profiler 2.0 in order to measure and identify performance problems in explicitly threaded code.
This is the first article in a two-part series that explains what Intel Thread Profiler for Win32 Threads does and how to use it effectively on your own explicitly threaded applications. Part One introduces the reader to the concepts and ideas used by the tool, as well as the kinds of data that are collected to measure processor utilization and to perform critical path analysis. Part Two will deal with the specifics of using the Thread Profiler tool on an example code to show how these concepts are put into practice to measure performance of threaded applications. Methodologies to identify and locate typical performance issues of threaded code with the tool will also be given.
The reader need not be acquainted with Intel® Threading Tools to understand the discussion in Part One. Familiarity with use of Intel Thread Profiler, however, will be assumed in Part Two.
Example of Threaded Execution
There are two main performance metrics that Intel Thread Profiler for Win32 Threads can measure and display: (1) Utilization of processing resources and (2) Characterization of how the execution of threads is impacted by other threads and events. Before covering the relevant metrics and concepts used by the tool, we present a simple example application using three threads on a dual-processor system.
Figure 1 shows an execution timeline of the three threads. At E0 the main process thread (T1) begins execution. At E1, thread T1 creates two other threads (T2 and T3). These threads begin execution at E2. Thread T1 calls WaitForMultipleObjects at E3, in order to wait until the created threads have completed execution. Some time after E2 and before E4, thread T2 acquires Lock L, so that when thread T3 requests Lock L at E4, it must wait. At E5, thread T2 releases Lock L and, at E6, thread T3 gains control of the lock. Thread T3 is blocked waiting for some external event at E7 that completes at E8. Thread T2 terminates at E9 while thread T3 terminates at E10. At E11, thread T1 has received signals that both threads being waited on have finished and so proceeds with some final computations before terminating itself at E12.
Figure 1. Example Threaded Execution
Utilization of Resources
Processor utilization is measured with the concurrency level at each stage of the threaded execution. Under Intel Thread Profiler for Win32 Threads, the concurrency level is defined to be the number of threads that are active at any given time -- that is, those threads that are executing or available for execution, not waiting or sleeping or blocked by any event or synchronization request. The concurrency level of the example application for the timeline in Figure 1 is given in the table below. Each entry corresponds to a separate time slice between two consecutive events and is labeled with the second event of the pair.
|Ending Event||Concurrency Level||Ending Event||Concurrency Level|
Table 1. Concurrency levels for threaded execution
Intel Thread Profiler for Win32 Threads defines five classifications of concurrency level:
- Idle – No threads are active, all are blocked or waiting for external events.
- Serial – A single thread is active. For some portion of the application, this will be a requirement (for example, startup, shutdown, initialization of global data). Unexpected or excessive time in serial execution may indicate the need for serial tuning or that the parallelism of the algorithm is not being effectively exploited.
- Undersubscribed – Fewer threads than available processors are active. If more threads were active, the processor resources would be better utilized, and the execution time of the application could be reduced. Time in this class may also indicate load imbalance between threads.
- Parallel – The same number of threads as processors are active. This is the ideal situation since all processing resources are being utilized by the application. Increasing the amount of time the application spends in this class while reducing the time spent in the previous classes should be a primary goal of threaded performance tuning.
- Oversubscribed – More threads than processors are available for execution. While not ideal, this situation is not as detrimental as underutilization of processors. However, time spent in this class may indicate that the application could execute with fewer threads and still maintain current performance levels.
Intel Thread Profiler for Win32 Threads gives a histogram summary of the concurrency level of the application. Bar color is further used to visually indicate the class of the concurrency level based on the platform architecture that was used to execute the application. Figure 2 shows an idealized version of the concurrency level histogram given by Intel Thread Profiler for Win32 Threads from the example application on a two-processor system. Height of the bars indicates the time spent in each concurrency level. Gray indicates that no threads are active (idle class); red and yellow show serial and undersubscribed execution; green indicates full parallelism; and oversubscribed time is indicated with blue bars.
Figure 2. Example Concurrency Level Histogram
For a dual-processor system, as in our example, the serial time is the only undersubscribed class. Running on a four-processor system would display additional refinements of the undersubscribed class, namely, two and three threads. In fact, a new bar for each different number of active threads, which was noted during execution of the application, will be displayed by Intel Thread Profiler for Win32 Threads. Classification and color of each bar depends on the number of processors available on the execution platform.
Intel Thread Profiler for Win32 Threads defines an execution flow as the time a thread is running during the course of application execution. A flow ends when a thread waits or terminates. A flow will split into two separate flows when a thread creates a new thread or signals (unblocks) another thread. Thus, at any given time, there can be multiple flows coinciding within an execution and these flows may move across threads as the threads interact with each other and the synchronization objects shared between them.
For the example application, there are four distinct flows. The first is thread T1 from E0 to E3. This flow splits into two additional flows at E1 onto threads T2 and T3. The flow on thread T3 terminates at E4 when the thread waits for Lock L. The T2 flow splits into two flows at E5. The original T2 flow continues until it terminates at E9 with the termination of the thread. The flow that is split off migrates to thread T3 when T2 releases the lock. This flow continues on thread T3 across the external blocking event since another thread does not cause T3 to pause. The flow attaches to thread T1 upon termination of thread T3, which unblocks T1. This flow is finally ended when T1 terminates.
The longest execution flow is known as the critical path in Intel Thread Profiler for Win32 Threads. The critical path for the example application is the flow that starts on thread T1, moves to thread T2 at E2, moves to thread T3 at E6, and then back to thread T1 at E11. This is a different usage of the term critical path from how it is used in Call Graph analysis within the Intel® VTune™ Performance Analyzer. In that case, critical path is the path (set of branches) through the call graph tree that accounted for the most execution time.
The critical path for a given execution cannot be determined within Intel Thread Profiler for Win32 Threads until the application has ceased execution. Between E5 and E9, two paths are potential candidates for the critical path. If thread T3 had terminated before thread T2, the execution flow through T2 would have been the critical path. As paths stop being potential candidates for the critical path, the full set of data collected by Intel Thread Profiler for Win32 Threads is abandoned. This is done in order to keep the amount of data collected and saved to a minimum.
If the execution time of the critical path can be shortened, then the entire application execution time will be shortened. By focusing data collection on the critical path, Intel Thread Profiler for Win32 Threads is able to watch which threads are on the critical path and which objects cause transitions of the critical path to other threads. Thus, the tool can identify which threads may be blocking other threads from running and which objects were used to block these threads.
Along the critical path, Intel Thread Profiler for Win32 Threads defines these four classifications of thread interactions.
- Cruise Time - Time a thread does not delay the next thread on the critical path.
- Overhead Time - Threading synchronization or operating system scheduling overhead.
- Blocking Time - Time a thread spends waiting for an external event or blocking while still on the critical path (includes timeouts).
- Impact Time - Time a thread on the critical path delays the next thread from getting on the critical path.
For the example application, the transition time on the critical path from when thread T1 creates thread T2 (E1 to E2) and the time between the termination of thread T3 and the resumption of thread T1 (E10 to E11) will be counted as Overhead Time. The time between E7 and E8, where thread T3 was blocked waiting for an external event will be considered Blocking Time. Since thread T2 was holding Lock L, and directly preventing thread T3 from proceeding, the portion of the critical path between E4 and E5 will be Impact Time.
The time on the critical path between E2 and E4 is cruising time. While the continued execution of threads T2 and T3 was directly responsible for the blocking of thread T1, this is still cruising time since no single thread on the critical path during that stage of the execution was blocking any other thread from execution. The time between E9 and E10, will be impact time since thread T3, on the critical path at that point, was the only thread preventing thread T1 from resuming execution.
Intel Thread Profiler for Win32 Threads uses a histogram summary for displaying the classifications of time on the Critical Path. This data is overlaid within the Concurrency Level histogram and indicated by intensity of color. Impact Time is indicated by a much brighter coloration while Cruising Time is indicated with paler shades. Figure 3 shows an idealized concurrency histogram with the Critical Path classification data. The bright Yellow indicates overhead when no threads were active (when T3 terminates at E10 to E11) and when a single thread was active (creation of T2, T3 and after T2 releases Lock L) on the Critical Path. Most of the parallel time (2 threads) was Cruising Time (pale green). Only for a short amount of time (T3 between E8 and E9) is one thread known to be directly impacting the execution of another thread (T1) shown in bright green.
Figure 3. Critical Path Data within Concurrency Histogram
This article has presented the two major concepts and metrics used by Intel Thread Profiler for Win32 Threads to measure performance of explicitly threaded applications. Concurrency Level is a measure of how well the processing resources of a system were utilized throughout the run of a threaded code. The “sweet spot” is to have the same number of threads actively executing as there are processors available. While fewer threads than processors results in idle processing cycles, having more threads keeps processors busy. Oversubscription of resources can be an indication that the same performance may be achieved with fewer threads.
Critical Path Analysis shows how threads interact with each other during execution. This analysis focuses on the execution flow whose performance characteristics would have the most impact on overall performance. Programmers should concentrate on the critical path when tuning the code. Reorganization of computations where one thread blocks another thread from execution or a thread is blocked by external events can lead to better performance.
In Part Two, we will examine how the ideas presented here are articulated within Intel Thread Profiler for Win32 Threads and how the information displayed within the tool can be used to point out and identify performance problems in threaded code.
- Using Intel® Thread Profiler for Win32* Threads: Nuts and Bolts
- Intel® Threading Tools and OpenMP*
- Advanced OpenMP* Programming
The author wishes to thank Douglas Armstrong and the Intel Threading Tools Development Team for their assistance with facts, details, and examples that went into the research and writing of this article.
About the Author
Clay Breshears is currently a Course Architect for the Intel Academic Community, specializing in multi-core and multithreaded programming and training.