We are converting a stochastic simulation fortran program to OpenMP as the outputs of the program can be summed.  In the simplest mode, we have just made the main loop a parallel region with firstprivate.  No matter how many threads we launch, the wall time consumed is roughly the time for a single thread times the number of threads.  The problem seems to be _kmp_launch_monitor which is having 200ms waits for ManualResetEvents.  Eliminating atomic and critical sections has little effect on the outcome.  Using OMP DO likewise.

Reading a bit on ManualResetEvents has not helped.  Where should we be looking for the cause of the ManualResetEvents?  Can we make the wait time shorter?  Make them go away? 

I gather that the launch monitor will always be there in an Intel OpenMP solution?  Otherwise the code is working as desired.

thanks for any suggestions.

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You cannot reset parameters of ManualResetEvent which is not exposed by OpenMP.
200ms is default value for infinite for a wait time of KMP_BLOCKTIME, You can export new value to change wait time.
Refer to, and you can search more via internet.

I changed the block time to 1ms & it reported 1ms when I 'got' it. But Vtune still shows 200ms. I've been running some small test programs and the auto-parallel is much faster thane the OpenMP parallel Do and very much faster than a just parallel region, even though all the threads are using cpu time but they seem very slow at their task. It is made parallel but not speeding up. This may be more of an Open MP problem than an Intel problem although I would like to try to have Vtune show what GET_KMP_BLOCKTIME says.

I used a small OpenMP example, and found that we can reduce wait time by changing KMP_BLOCKTIME value .

# export KMP_BLOCKTIME=200
# time ./matrix

real 0m3.806s
user 0m38.232s
sys 0m0.449s

# export KMP_BLOCKTIME=20
# time ./matrix

real 0m3.135s
user 0m33.910s
sys 0m0.082s

You can use locksandwaits analysis to analyze, compare their results - wait time of _kmp_launch_monitor(). Using KMP_BLOCKTIME=200
took much wait time.


Downloadimage/png kmp200.png75.65 KB
Downloadimage/png kmp20.png89.64 KB

I've been using locks and waits but by examining _KMP_launch (as in your enclosures), I still get waits of 200 ms after inserting:


which gives me a report of 1ms. I'm using MS Visual Studio and Fortran. I assume that
# export KMP_BLOCKTIME=20
# time ./matrix
has to do with C and linux?

Do I have to set the environment variable elsewhere?

You are right. I used C/C++ example code on Linux.

You might attach your fortran code which works on Windows - I might help to test it on my side.


All I have in this regard is:


which, as I said, returns 1. After that, there are about 2000 lines of code and comments, most of which is openMP. I don't explicitly mess with environmental variables after these first two lines. In using locks and waits, the KMP launch line shows 200 ms waits.

I just used a simple omp example Fortran code, named openmp_samlpes project from Intel Composer XE 2013 SP1.
I inserted -
call KMP_SET_BLOCKTIME(20), at begin
print *, ' KMP BLOCK TIME= ',KMP_GET_BLOCKTIME(), at end

Here are results:
Range to check for Primes: 1 10000000
We are using 4 thread(s)
Number of primes found: 664579
Number of 4n+1 primes found: 332181
Number of 4n-1 primes found: 332398
Press any key to continue . . .

Hi Peter,

Right; that is what I get too; now, what does Vtune tell you?

1. If I used "call KMP_SET_BLOCKTIME(2)"
Thread / Function / Call Stack Wait Time by Utilization Wait Count Spin Time Module Function (Full)
__kmp_launch_monitor (0xbd4) 3.246s 17 0s

2. If I used "call KMP_SET_BLOCKTIME(200)"
Thread / Function / Call Stack Wait Time by Utilization Wait Count Spin Time Module Function (Full)
__kmp_launch_monitor (0x16e4) 3.319s 17 0s

Leave a Comment

Please sign in to add a comment. Not a member? Join today