OpenMP* Code Analysis Method
- Execution of serial portions (outside of any parallel region): When the master thread is executing a serial region, the worker threads are in the OpenMP runtime waiting for the next parallel region.
- Load imbalance:When a thread finishes its part of workload in a parallel region, it waits at a barrier for the other threads to finish.
- Not enough parallel work:The number of loop iterations is less than the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all.
- Synchronization on locks:When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource.
Compile Your Code with Recommended Options
- To analyze OpenMP parallel regions, make sure to compile and run your code with the Intel® Compiler 13.1 Update 2 or higher (part of the Intel Composer XE 2013 Update 2). If an obsolete version of the OpenMP runtime libraries is detected,VTuneprovides a warning message. In this case the collection results may be incomplete.ProfilerTo access the newest OpenMP analysis options described in the documentation, make sure you always use the latest version of the Intel compiler.
- On Linux*, to analyze an OpenMP application compiled with GCC*, make sure the GCC OpenMP library (libgomp.so) contains symbol information. To verify, search forlibgomp.soand use thenmcommand to check symbols, for example:nm libgomp.so.1.0.0If the library does not contain any symbols, either install/compile a new library with symbols or generate debug information for the library. For example, on Fedora* you can install GCC debug information from theyumrepository:yum install gcc-debuginfo.x86_64
Configure OpenMP Analysis
- Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysisbutton on theIntel® VTune™toolbar.ProfilerTheConfigure Analysiswindow opens.
- FromHOWpane, click the Browse button and select an analysis type that supports OpenMP analysis: Threading, HPC Performance Characterization, Memory Access, or any Custom Analysis type.
- Select theAnalyze OpenMP regionsoption, if it is not pre-selected (see theDetailssection to confirm).
- Click the Start button to run the analysis.
Explore Application-Level OpenMP Metrics
Identify Serial Code
Estimate Potential Gain
- Imbalance: threads are finishing their work in different time and waiting on a barrier. If imbalance time is significant, try dynamic type of scheduling. Intel OpenMP runtime library from Intel Parallel Studio Composer Edition reports precise imbalance numbers and the metrics do not depend on statistical accuracy as other inefficiencies that are calculated based on sampling.
- Lock Contention: threads are waiting on contended locks or "ordered" parallel loops. If the time of lock contention is significant, try to avoid synchronization inside a parallel construct with reduction operations, thread local storage usage, or less costly atomic operations for synchronization.
- Creation: overhead on a parallel work arrangement. If the time for parallel work arrangement is significant, try to make parallelism more coarse-grain by moving parallel regions to an outer loop.
- Scheduling: OpenMP runtime scheduler overhead on a parallel work assignment for working threads. If scheduling time is significant, which often happens for dynamic types of scheduling, you can use a "dynamic" schedule with a bigger chunk size or "guided" type of schedule.
- Atomics: OpenMP runtime overhead on performing atomic operations.
- Reduction: time spent on reduction operations.
- Maximum number of supported lexical parallel regions is 512, which means that no region annotations will be emitted for regions whose scope is reached after 512 other parallel regions are encountered.
- Regions from nested parallelism are not supported. Only top-level items emit regions.
- VTunedoes not support static linkage of OpenMP libraries.Profiler