Experimental feature: Viewing load imbalance in OpenMP* applications with Intel® VTune™ Amplifier XE

With Intel® VTune™ Amplifier XE 2013 Update 12 and earlier it was possible to profile OpenMP applications with parallel regions as described in the article Profiling OpenMP* applications with Intel® VTune™ Amplifier XE by Kirill Rogozhin.

Intel® VTune™ Amplifier XE 2013 Update 13 provides extra information for profiling OpenMP applications – per-barrier information about load imbalance. Using this feature requires Intel® Composer XE 2013 SP1 Update 1 (or newer) that includes Intel® OpenMP runtime.

Please note, this new functionality has not been documented yet, so it may slightly change in future releases.

Consider code like this

#pragma omp parallel  
// Code 1
#pragma omp single
{ ... } // Implicit barrier     (barrier 1)
       // Code 2         
} // Implicit join barrier             (barrier 2)
// Code 3
#pragma omp parallel  
// Code 4
#pragma omp barrier             (barrier 3)
// Code 5         
#pragma omp for
       for (...) {
         // Code 6
       } // Implicit for barrier       (barrier 4)
         // Code 7   
} // Implicit join barrier             (barrier 5)

Since parallel regions have barriers inside them it is not enough to have per region frame information to understand the location of imbalance. Each piece of code between two barriers can be a reason of load imbalance. Using parallel region frames allows to estimate load balance inside the whole parallel region but doesn't help with analyzing load balance between barriers.

The new functionality will provide a user with detailed information about per barrier balance. VTune Amplifier XE will emit the following OpenMP frames where each barrier ends a frame and starts another:

28:    #pragma omp parallel              <- Frame 1 begin
// Code 1
31:          #pragma omp single
              { ... } // Implicit barrier<- Frame 1 end (barrier 1)
                    // Code 2                  <- Frame 2 begin
34:    } // Implicit join barrier        <- Frame 2 end (barrier 2)
// Code 3
37:    #pragma omp parallel              <- Frame 3 begin
// Code 4
44:          #pragma omp barrier        <- Frame 3 end (barrier 3)
// Code 5                  <- Frame 4 begin
50:          #pragma omp for
                    for (...) {
             // Code 6
55:          } // Implicit for barrier  <- Frame 4 end (barrier 4)
// Code 7                  <- Frame 5 begin
59:    } // Implicit join barrier        <- Frame 5 end (barrier 5)

To control the new feature the OpenMP runtime introduced the new environment variable KMP_FORKJOIN_FRAMES_MODE that accepts values from 0 to 3.

Value 0 disables per-barrier frames which means only the existing (per region frames) functionality will be available.

KMP_FORKJOIN_FRAMES_MODE=1 enables frames for all barriers – explicit and implicit. In VTune Amplifier XE you will see something like this:

Note that the line number specifies the frame end point because the frame is associated with the barrier that ends it.

The corresponding Tasks and Frames view looks like this:

By setting KMP_FORKJOIN_FRAMES_MODE=2 users can get even more information about thread activity. In this mode VTune Amplifier XE will display barrier-imbalance frame domains like in the picture below. The frame Duration shows how much time passes between the moment when the first thread arrives at a barrier and the last thread leaves it.

It is also possible to display combined information about per-barrier frames and barrier imbalance by setting KMP_FORKJOIN_FRAMES_MODE=3. So, the whole frame timing information as well as imbalance part is displayed:


The information presented for OpenMP programs which use OpenMP tasking may be hard to understand, since threads which are “waiting at a barrier” may actually be executing OpenMP tasks.


The per-barrier frame information can provide better understanding of the behavior of OpenMP applications that have implicit or explicit barriers inside parallel regions and make it obvious where load imbalance is present making it easier to improve performance.