Sleeping Threads in MKL

Sleeping Threads in MKL

In a simple test program I have measured the performance
of MKL6.0 DGEMM on a Dual Xeon (2.66 GHz, 533FSB) for
different matrix sizes.
When OMP_NUM_THREADS is greater than 1, I encounter
program stalls, i.e. the threads just start sleeping
and do not do any more work. The matrix size for
which this happens differs from run to run with the
same binary.
Has anybody else seen this effect yet? Any ideas?


8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Georg,
I have some ideas about what might be happening but I need a little more information about your program. DGEMM scales very well for large matrices. Are you seeing any parallel speedup when DGEMM is executed with two threads? Is OMP_NUM_THREADS greater than the number of CPUs? Do the threads start sleeping as the matrix sizes get smaller? Is DGEMM called inside an OpenMP parallel region?


Hi Henry,

this is my simple test program:

do i=10,200

do j=1,jend
call dgemm('N','N',i,i,i,1.d0,a,i,b,i,0.d0,c,i)
write (*,*) i,(jend*2.d0*dble(i)*dble(i)*dble(i))/st

I see some parallel speedup, although with matrix sizes
as small as in this program the speedup is moderate
(this was written in order to investigate performance
of MKL for a larger application program that tends to
use rather small matrices with DGEMM).
OMP_NUM_THREADS was set to 2. The program works fine
up to i=40 or so and then hangs, but not always at
the same i - sometimes it gets as high as 100, another
time i=50 is the limit. As you can see, the code does
not use any OpenMP by itself (I have left out
the variable declarations etc.). The MPI_WTIME()
is there for convenience, one can of course use any
other timing mechanism.

We have seen this effect also in "real" OpenMP
applications that were compiled with the Intel
compilers, on IA32 as well as on IA64 systems.
Starting with MKL6 though, it became very pronounced.

Kind Regards,

Hi Georg,
I compiled the following program with the Intel 7.1 Fortran compiler and ran it on a dual-processor Windows 2000 Pro workstation with OMP_NUM_THREADS set to one or two threads:

      program mklomp
      double precision a(200,200), b(200,200), c(200,200)
      integer start, finish, rate
      real seconds
      call system_clock (COUNT_RATE = rate)
      do i = 10, 200
         jend = (dble(1000)**3 + 1.d0) / (5 * (dble(i)**3)) + 1
         call system_clock (COUNT = start)
         do j = 1, jend
            call dgemm('N', 'N', i, i, i, 1.d0, a, i, b, i, 0.d0, c, i)
         call system_clock (COUNT = finish)
         seconds = float (finish - start) / float (rate)
         write(*,*) i, jend, seconds,
     +        (jend * 2.d0 * dble(i) * dble(i) * dble(i)) / seconds

The program did not hang and showed reasonable parallel speedup going from one to two threads.

Please check that my test program is an accurate representation of yours. What operating system are you using?

Best regards,

Hi Henry,

I'm using Linux (Debian, Redhat, SuSE, it happens on all
of them, with different compiler and libc versions).

I have compiled your program with ifc 7.1 and linked to
MKL 6.0:

ifc -parallel -static momptest.f -L/opt/intel/mkl/lib/32 -lmkl_ia32

When setting OMP_NUM_THREADS=2 it hangs sometimes after some iterations, as described.

A little sidenote: I had to insert something like
a=0 before the main loop, so that the compiler
generates a (auto-)parallel region. This is necessary,
I have observed, because if the program runs into
MKL (DGEMM) without having executed at least one parallel
region first, I get runtime errors about stacksize
problems (shell limit is 4 GBytes!), reproducibly at
16 48829 0.2635000 1518053739.65626
Unable to set worker thread stacksize to 4194304
Perhaps try reducing KMP_STACKSIZE or increasing your shell stack limit.

Setting KMP_STACKSIZE to anything doesn't help. But
maybe I'm doing something seriously wrong here...

Kind regards,

Hi Georg,
The -parallel option should not be necessary to use MKL nor should it be necessary to execute a parallel region before calling an MKL function. This could be an MKL bug but I'm not able to reproduce it locally. Please submit this issue to Intel Premier Support. The MKL experts can probably explain what's happening.

What error message is given about stack limits? You shouldn't have to adjust the KMP_STACKSIZE environment variable because MKL functions should not overflow the thread stacks.

Best regards,

Hi Henry,

ok so this time I've done it by the book. That's my shell log:
~/loopkernels > ifc momptest.f -L/opt/intel/mkl/lib/32 -lmkl_ia32 -lguide -lpthread program MKLOMP

29 Lines Compiled
~/loopkernels > ./a.out
10 200001 0.4974000 804185788.957131
11 150263 0.4789000 835247688.594273
12 115741 0.3382000 1182734750.32458
13 91034 0.3794000 1054305166.88130
14 72887 0.4212000 949676754.895770
15 59260 0.3890000 1028290492.21332
16 48829 0.2761000 1448776363.54996
OMP abort: Unable to set worker thread stack size to 4195328 bytes
Try reducing KMP_STACKSIZE or increasing the shell stack limit.

No KMP_STACKSIZE was set here, and OMP_NUM_THREADS was 2.
There is no problem with OMP_NUM_THREADS=1.

As I had said, my shell stack limit is at 4GBytes. If I add -parallel
to the compiler command, the stacksize problem goes away because
of the additional parallel region in the initialization loop(s). If I
prevent those loops from being parallelized, the stacksize problem

I think I will now submit both issues (stacksize and sleeping threads)
to premier support. Thank you nevertheless for your help.

Kind regards,

Hi Georg,
When the MKL team gives you a solution to this problem, please post it here.


Leave a Comment

Please sign in to add a comment. Not a member? Join today