User Guide

Contents

OpenMP* Analysis from the Command Line

Use the
Intel® VTune™
Profiler
command line interface for performance analysis of OpenMP* applications compiled with Intel® Compiler.
Prerequisites:
  • To analyze OpenMP parallel regions, make sure to compile and run your code with the Intel® Compiler 13.1 Update 2 or higher (part of the Intel Composer XE 2013 Update 2). If an obsolete version of the OpenMP runtime libraries is detected,
    VTune
    Profiler
    provides a warning message. In this case the collection results may be incomplete.
    To access the newest OpenMP analysis options described in the documentation, make sure you always use the latest version of the Intel compiler.
  • On Linux*, to analyze an OpenMP application compiled with GCC*, make sure the GCC OpenMP library (
    libgomp.so
    ) contains symbol information. To verify, search for
    libgomp.so
    and use the
    nm
    command to check symbols, for example:
    nm libgomp.so.1.0.0
    If the library does not contain any symbols, either install/compile a new library with symbols or generate debug information for the library. For example, on Fedora* you can install GCC debug information from the
    yum
    repository:
    yum install gcc-debuginfo.x86_64
OpenMP is a fork-join parallel model, which starts with an OpenMP program running with a single master serial-code thread. When a parallel region is encountered, that thread forks into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as
barrier
and
single
are used for work coordination. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code.
Ideally, parallelized applications have working threads doing useful work from the beginning to the end of execution, utilizing 100% of available CPU core processing time. In real life, useful CPU utilization is likely to be less when working threads are waiting, either actively spinning (for performance, expecting to have a short wait) or waiting passively, not consuming CPU. There are several major reasons why working threads wait, not doing useful work:
  • Execution of serial portions (outside of any parallel region)
    : When the master thread is executing a serial region, the worker threads are in the OpenMP runtime waiting for the next parallel region.
  • Load imbalance:
    When a thread finishes its part of workload in a parallel region, it waits at a barrier for the other threads to finish.
  • Not enough parallel work:
    The number of loop iterations is less than the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all.
  • Synchronization on locks:
    When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource.
VTune
Profiler
together with Intel Composer XE 2013 Update 2 or higher help you understand how an application utilizes available CPUs and identify causes of CPU underutilization.

Configure and Run an Analysis

To run the OpenMP analysis from the command line, use the
threading
or
hpc-performance
analysis types. For example:
vtune
-collect hpc-performance -- myApp
The HPC Performance Characterization analysis generates a summary report with OpenMP metrics and descriptions of detected performance issues.
For the Threading and HPC Performance Characterization analysis types, OpenMP analysis option is enabled by default. You may also create a custom analysis and explicitly enable this
knob
option:
analyze-openmp=true
. For example:
vtune
-collect-with runsa -knob analyze-openmp=true -knob enable-user-tasks=true -- myApp

View Summary Report Data

When the data collection is complete, the
VTune
Profiler
automatically generates the summary report. Similar to the
Summary
window, available in GUI, the summary report provides overall performance data of your target.
Use the following syntax to generate the Summary report from a pre-existing result:
vtune
-report summary -result-dir
<result_path>
For HPC Performance Characterization analysis, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:
  • Use the
    -report-knob show-issues=false
    option when generating the report, for example:
    vtune
    -report summary -r r001hpc -report-knob show-issues=false
  • Use the option to view the report in the CSV format, for example:
    vtune
    -report summary -r r001hpc -format=csv
Explore the OpenMP Analysis section of the summary report for inefficiencies in parallelization of the application:
Serial Time: 0.069s (0.3%) Parallel Region Time: 23.113s (99.7%) Estimated Ideal Time: 14.010s (60.4%) OpenMP Potential Gain: 9.103s (39.3%) | The time wasted on load imbalance or parallel work arrangement is | significant and negatively impacts the application performance and | scalability. Explore OpenMP regions with the highest metric values. | Make sure the workload of the regions is enough and the loop schedule | is optimal.
This section shows the Collection Time as well as the duration of serial (outside of any parallel region) and parallel portions of the program. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.

Estimate Potential Gain

To estimate the efficiency of CPU utilization in the parallel part of the code, use the Potential Gain metric. This metric estimates the difference in the Elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution.
Use the
hotspots
report to identify the hottest program units. Use the following command to list the top five parallel regions with the highest Potential Gain metric values:
vtune
-report hotspots -result-dir r001hpc -group-by=region -sort-desc="OpenMP Potential Gain:Self" -column="OpenMP Potential Gain:Self" -limit 5
where
  • -report hotspots
    is the hotspots report type
  • -group-by=region
    is the action-option to group data in the report by OpenMP Regions
  • -sort-desc="OpenMP Potential Gain:Self"
    is the action-option to sort data by OpenMP Potential Gain in descending order
  • -column="OpenMP Potential Gain:Self"
    is the action-option to display only the OpenMP Potential Gain metric in the report
  • -limit 5
    is the action-option to set the number of top items to include in the report
The command above produces the following output:
OpenMP Region OpenMP Potential Gain ---------------------------------------------------------------- --------------------- compute_rhs_$omp$parallel:24@/root/work/apps/OMP/SP/rhs.f:17:433 3.417s x_solve_$omp$parallel:24@/root/work/apps/OMP/SP/x_solve.f:27:315 0.920s z_solve_$omp$parallel:24@/root/work/apps/OMP/SP/z_solve.f:31:321 0.913s y_solve_$omp$parallel:24@/root/work/apps/OMP/SP/y_solve.f:27:310 0.806s pinvr_$omp$parallel:24@/root/work/apps/OMP/SP/pinvr.f:20:41 0.697s
If Potential Gain for a region is significant, you can go deeper and analyze inefficiency metrics like Imbalance by barriers. Use the following command:
vtune
-report hotspots -result-dir r001hpc -group-by=region,barrier -sort-desc="OpenMP Potential Gain:Self" -column="OpenMP Potential Gain" -limit 5
where
  • -report hotspots
    is the hotspots report type
  • -group-by=region, barrier
    is the action-option to group data in the report by OpenMP Regions and OpenMP Barrier-to-Barrier Segment
  • -sort-desc="OpenMP Potential Gain:Self"
    is the action-option to sort data by OpenMP Potential Gain in descending order
  • -column="OpenMP Potential Gain"
    is the action-option to display the metrics with OpenMP Potential Gain string (including OpenMP Potential Gain: Imbalance and others)
  • -limit 3
    is the action-option to set the number of top items to include in the report
The command above produces the output that includes the following data:
OpenMP Region
OpenMP Barrier-to-Barrier Segment
OpenMP Potential Gain
OpenMP Potential Gain:Imbalance
OpenMP Potential Gain:Lock Contention
OpenMP Potential Gain:Creation
OpenMP Potential Gain:Scheduling
compute_rhs_$omp$parallel:24@/root/work/OMP/SP/rhs.f:17:433
compute_rhs_$omp$loop_barrier_segment@/root/work/OMP/SP/rhs.f:285
0.985s
0.982s
0s
0s
0.000s
x_solve_$omp$parallel:24@/home/root/work/OMP/SP/x_solve.f:27:315
x_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/x_solve.f:315
0.920s
0.904s
0.012s
0.000s
0.000s
z_solve_$omp$parallel:24@/root/work/OMP/SP/z_solve.f:31:321
z_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/z_solve.f:321
0.913s
0.910s
0.000s
0.000s
0.000s
y_solve_$omp$parallel:24@/root/work/OMP/SP/y_solve.f:27:310
y_solve_$omp$loop_barrier_segment@/root/work/OMP/SP/y_solve.f:310
0.806s
0.803s
0.000s
0.000s
0.000s
Analyze the
OpenMP Potential Gain
columns data that shows a breakdown of Potential Gain in the region by representing the cost (in elapsed time) of the inefficiencies with a normalization by the number of OpenMP threads. Elapsed time cost helps decide whether you need to invest into addressing a particular type of inefficiency.
VTune
Profiler
can recognize the following types of inefficiencies:
  • Imbalance
    : threads are finishing their work in different time and waiting on a barrier. If imbalance time is significant, try dynamic type of scheduling. Intel OpenMP runtime library from Intel Parallel Studio Composer Edition reports precise imbalance numbers and the metrics do not depend on statistical accuracy as other inefficiencies that are calculated based on sampling.
  • Lock Contention
    : threads are waiting on contended locks or "ordered" parallel loops. If the time of lock contention is significant, try to avoid synchronization inside a parallel construct with reduction operations, thread local storage usage, or less costly atomic operations for synchronization.
  • Creation
    : overhead on a parallel work arrangement. If the time for parallel work arrangement is significant, try to make parallelism more coarse-grain by moving parallel regions to an outer loop.
  • Scheduling
    : OpenMP runtime scheduler overhead on a parallel work assignment for working threads. If scheduling time is significant, which often happens for dynamic types of scheduling, you can use a "dynamic" schedule with a bigger chunk size or "guided" type of schedule.
  • Atomics
    : OpenMP runtime overhead on performing atomic operations.
  • Reduction
    : time spent on reduction operations.

Limitations

VTune
Profiler
supports the analysis of parallel OpenMP regions with the following limitations:
  • Maximum number of supported lexical parallel regions is 512, which means that no region annotations will be emitted for regions whose scope is reached after 512 other parallel regions are encountered.
  • Regions from nested parallelism are not supported. Only top-level items emit regions.
  • VTune
    Profiler
    does not support static linkage of OpenMP libraries.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804