Locks and Waits analysis seems to loose data

I'm new to VTune Amplifier XE 2011. I attempted to do a Locks and Waits analysis on my Fortran OpenMP code and find that the worker threads seem to be missing a lot of events, as in this screenshot.

The OpenMP parallel sections are indicated by the blue arrows in the master thread. The other threads should have all the same parallel sections, but as can be seen, each one quits showing them at some point with the rest of the time being reported as waiting for the barrier at it's last blue arrow. The same is true if I do a Concurrency analysis, but if I do a Hotspots analysis it shows all of the OpenMP parallel sections correctly in all threads. Does anyone have any idea what is going on here?

Intel VTune Amplifier XE 2011.4 build 176374
no kernel driver, installed as non-root
ifort 11.1 20100806
Linux CentOS 5.5 kernel 2.6.18-238.19.1.el5
Dell Precision M4500 with quad core Core i7 and hyperthreading (8 virtual cores)

After you uploaded jpeg file, please insert it into the text.

Also you can use "Add Files" to add jpeg file as attachment.

It's better that you can provide test code, so other can reproduce this problem and investigate why.

Here I have an example program (compute primes in OMP parallel...). See attached files (built with ifort 12.0 and use VTune Amplifier XE Update 4)

[root@NHM02 peter]# source /opt/intel/compilerpro- intel64

[root@NHM02 peter]# ifort -g -openmp -openmp-report -fpp openmp_sample.f90 -o openmp_sample.ifort
openmp_sample.f90(82) (col. 7): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.f90(73) (col. 7): remark: OpenMP DEFINED REGION WAS PARALLELIZED.

[root@NHM02 peter]# source /opt/intel/vtune_amplifier_xe_2011/amplxe-vars.sh
Copyright (C) 2009-2011 Intel Corporation. All rights reserved.
Intel VTune Amplifier XE 2011 (build 176374)

[root@NHM02 peter]# amplxe-cl -collect locksandwaits -- ./openmp_sample.ifort
Warning: Symbol file is not found. The call stack passing through the module /opt/intel/composerxe-2011.0.048/compiler/lib/intel64/libiomp5.so may be incorrect
Range to check for Primes: 1 10000000
We are using 8 thread(s)
Number of primes found: 664579
Number of 4n+1 primes found: 332181
Number of 4n-1 primes found: 332398
Using result path `/home/peter/r000lw'
Executing actions 74 % Generating a report


Average Concurrency: 6.841
Elapsed Time: 0.540
CPU Time: 3.200
Wait Time: 0.686
Executing actions 99 % done

It seem that all OpenMP* parallel sections can be displayed, andwait time &counts of barriers & join (sync-objs) alsowere displayed.

PMU event counts are not inLocksandWaits analysis.

Isit your application specific issue?

Downloadapplication/octet-stream openmp_sample.f904.03 KB

With openmp_sample I get similar results to what you show, except that the master thread shows up as Running. I also get expected results with the Locks and Waits tutorial. So it is probably application specific. One difference with my code from your sample and the tutorial is that I have many OpenMP parallel regions interspersed with sequential code rather than a single parallel region.

The adaptive finite element code is over 100,000 lines, so I can't just post it here, and it will take a while to reduce it down to something reasonable. But if you want to try the full code, you can download it from http://math.nist.gov/phaml and buld it as follows:

tar -xzf phaml-1.9.1.tar.gz
cd phaml-1.9.1
./mkmkfile.sh F90 intel PARALLEL openmp PARLIB none GRAPHICS none
edit src/Makefile to add -g to FFLAGS and CFLAGS
cd examples/simple
edit Makefile to add -g to FFLAGS

then use phaml as the target for the VTune analysis


Thanks for the URL of the example, and instructions to build.

I still have to go the directory examples/simple anddo "make", generate phaml, for simple test.

Then do : amplxe-cl -collect locksandwaits -- ./phaml, it seemed everythingwas OK.

It did make sense on difference about OMP parallel regions between my example and your phaml. Because OMP parallel code in my example continually worked, but OMP parallel code in yourphaml worked intermittently.

Thanks for running my program and showing the result. That looks like what I expect. Wish I could show you what mine looks like :( All the worker threads stop showing "running" or "OpenMP Regions" pretty early, say around .15s to .2s on the image above, and show it in a "wait" state for the rest of the timeline.

Given that it works correctly on your computer, I must have something wrong in my installation or environment or something. Any clues on how to track that down? How closely can you approximate my actual environment, given in the original post? For the Locks and Waits analysis, it shouldn't matter that I don't have a kernel driver, right?


For Locksandwaits analysis and other user mode sampling collections, there is no vtune driver required.

If you have concerns (says around .15s to .2s, all the worker threads stop showing "running" or "OpenMP regions" pretty early) on your results, please zip/attach result directory - I would like to look into.

Collect additional information aboutyour system and environment run feedback tool,
amplxe-feedback -create-bug-report

Hi Guys,

I think I'm experiencing the same issue.

I'm running VTune XE 2011 (build 186533) to try and determine why I'm not getting the speedup I'd hope for with OpenMP.

I've tried using both STATIC and DYNAMIC schedulers on my parallel loops but with both, VTune is showing OMP Worker Thread 1 # (I'm only using 2 threads) to only occasionally be taking part in the work. When it's not shown as working, it's marked as Waiting.

It would be useful to know if this is:

a) A problem with my application (great - let's fix it and my code runs faster) or...
b) A bug in VTune (not so great)

Kind regards,


It doesn't make sense to use 2 threads for STATIC and DYNAMIC schedulers, please use more threads - for example, "export OMP_NUM_THREADS=8"

Please use latest Update 5, which has some issues fixed for OMP.

Hi Peter,

Thanks for your response.

What scheduler would you recommend to use on a Core 2 Duo processor?

I will update my compiler and VTune to the latest versions and re-try.

Hi Pete,

The scheduler of OMP you used - STATIC or DYNAMIC, it doesn't depend on processor type.

As I know, STATIC means that you can estimate workload of each iteration in loop, so OMP scheduler will execute you expect; DYNAMIC meansif someshort thread terminates quickly, OMP scheduler will assignother work to run.

