Develop a methodology for the tuning phase of the development cycle. The tuning phase increases performance incrementally where possible.
Identify potential performance problems using Intel® Thread Profiler. The structured nature of parallel and serial regions, and the thread execution model of OpenMP*, strictly fork-join, lends itself easily to the measurement of these and other potential performance problems using Intel Thread Profiler.
Algorithms coded with explicit threading libraries, such as Win32* Threads, do not have the structured execution inherent in OpenMP. In fact, there are very few restrictions for how such threaded applications can be designed with regard to threads and their interactions. Even if a programmer were to mimic the execution of OpenMP threads, an automated tool would have a near impossible task of identifying this fact. Thus, a different method of analyzing the performance of explicitly threaded applications is required. A new method of this kind has been developed, and it is included in Intel Thread Profiler in order to measure and identify performance problems in explicitly threaded code.
Overviews of using Intel Thread Profiler for OpenMP and explicitly threaded applications are given below. Note that while this example focuses on Win32 threads, the Linux* developer will also benefit from the white paper "Intel® Tools for Thread-Oriented Development on Linux*."
Use Intel Thread Profiler for OpenMP applications: In order to view performance statistics collected for an OpenMP application, build the application with the /Qopenmp_profile option to link in the VTune™ environment for runtime replacement of the OpenMP library with the instrumented version.
If you build the application with the /Qopenmp_profile, run it from the command line; upon completion it generates a '.gvs' file that can be viewed from within the VTune environment. Intel Thread Profiler supports viewing data from multiple runs and comparing them at the same time. It supports many views that enable the user to see a performance summary of t application or a thread-specific breakdown. Since OpenMP is structured, the application can also be viewed as various regions (parallel, serial, etc.) The following figure shows sample screen shots of Intel Thread Profiler, which illustrate the Threads view for an application with thread imbalance (left) and the same application with this problem fixed (right):
The threading tools come with OpenMP tuning advice online help, which provides suggestions for OpenMP performance problems.
Use Intel Thread Profiler for explicitly threaded applications: Processor utilization is measured with the concurrency level at each stage of the threaded execution. Under Intel Thread Profiler for Windows*, the concurrency level is defined to be the number of threads that are active at any given time-that is, those threads that are executing or available for execution, not waiting or sleeping, or blocked by any event or synchronization request.
Intel Thread Profiler for Win32 Threads defines five classifications of concurrency level:
- Idle – No threads are active, all are blocked or waiting for external events.
- Serial – A single thread is active. For some portion of the application, this will be a requirement (for example, startup, shutdown, initialization of global data). Unexpected or excessive time in serial execution may indicate the need for serial tuning or that the parallelism of the algorithm is not being effectively exploited.
- Undersubscribed – Fewer threads than available processors are active. If more threads were active, the processor resources would be better utilized, and the execution time of the application could be reduced. Time in this class may also indicate load imbalance between threads.
- Parallel – The same number of threads as processors are active. This is the ideal situation since all processing resources are being utilized by the application. Increasing the amount of time the application spends in this class while reducing the time spent in the previous classes should be a primary goal of threaded performance tuning.
- Oversubscribed – More threads than processors are available for execution. While not ideal, this situation is not as detrimental as underutilization of processors. However, time spent in this class may indicate that the application could execute with fewer threads and still maintain current performance levels.
Intel Thread Profiler for Windows gives a histogram summary of the concurrency level of the application. The figure below shows an idealized version of the concurrency level histogram given by Intel Thread Profiler. Height of the bars indicates the time spent in each concurrency level. Gray indicates that no threads are active (idle class); red and yellow show serial and undersubscribed execution; green indicates full parallelism; and oversubscribed time is indicated with blue bars.
Intel Thread Profiler for Windows defines an execution flow as the time a thread is running during the course of application execution. A flow ends when a thread waits or terminates. A flow will split into two separate flows when a thread creates a new thread or signals (unblocks) another thread. Thus, at any given time, there can be multiple flows coinciding within an execution, and these flows may move across threads as the threads interact with each other and the synchronization objects shared between them.
The longest execution flow is known as the critical path. This is a different usage of the term from how it is used in Call Graph analysis within the VTune™ Performance Analyzer. In that case, critical path is the path (set of branches) through the call graph tree that accounts for the most execution time.
The critical path for a given execution cannot be determined within Intel Thread Profiler until the application has ceased execution. As paths stop being potential candidates for the critical path, the full set of data collected by Intel Thread Profiler is abandoned. This is done in order to keep the amount of data collected and saved to a minimum.
If the execution time of the critical path can be shortened, then the entire application execution time will be shortened. By focusing data collection on the critical path, Intel Thread Profiler is able to watch which threads are on the critical path and which objects cause transitions of the critical path to other threads. Thus, the tool can identify which threads may be blocking other threads from running, and which objects were used to block these threads. Along the critical path, Intel Thread Profiler defines these four classifications of thread interactions.
- Cruise Time – The time a thread does not delay the next thread on the critical path.
- Overhead Time – Threading synchronization or operating system scheduling overhead.
- Blocking Time – The time a thread spends waiting for an external event or blocking while still on the critical path (includes timeouts).
- Impact Time – The time a thread on the critical path delays the next thread from getting on the critical path.
Intel Thread Profiler uses a histogram summary, overlaid within the Concurrency Level histogram, to display the classification s of time on the critical path. Impact Time is indicated by a much brighter coloration, while Cruising Time is indicated with paler shades. The following figure shows an idealized concurrency histogram with the Critical Path classification data. The bright yellow color indicates overhead when no threads were active, and when a single thread was active on the Critical Path. Most of the parallel time was Cruising Time (pale green). Only for a short amount of time is one thread known to be directly impacting the execution of another thread, shown in bright green.
Additional techniques for use in the tuning phase for threaded applications are covered in separate items:
- VTune™ Performance Analyzer to Detect Idle Time
- Use the VTune™ Performance Analyzer to Track Down Overhead
- Threading Methodology: Principles and Practices
- Using Intel® Thread Profiler for Win32* Threads: Philosophy and Theory