Help on OpenMP and MKL on Linux

Help on OpenMP and MKL on Linux

According to the manual, I should be able to use MKL routine within OpenMP parallel region, and MKL
routine should not create new threads.
I tested one matrix/vector multiplication routine in MKL's BLAS with following code:

========================================================= starting main.cpp
#include "mkl_cblas.h"

using namespace std;

int main(int argc, char** argv) {
int m, n, i,j;
double *a, *b, *c, sum=0.0;
if (argc>2) { m=atoi(argv[1]); n=atoi(argv[2]); } else { m=3; n=3; }
a=new double[m];
b=new double[m*n];
c=new double[n];

#pragma omp parallel default(none) shared(n,m,a,b,c) private(i,j)
/* initialize with values for matrix b and vector c */
#pragma omp for
for (j=0; j #pragma omp for
for (i=0; i for (j=0; j
/* Matrix*Vector Calculation a=b*c */
cblas_dgemv(CblasRowMajor, CblasNoTrans, m, n, 1.0, b, n, c, 1, 0.0, a, 1);

/* print out vector a and its sum*/
for (i=0; i #pragma omp parallel for reduction(+:sum)
for (i=0; i cout << "result="<
/* clean up */
delete[] a,b,c;
return (0);
=============================== end main.cpp

And then compiled it with:
icpc -openmp -o main main.cpp -L/opt/intel/mkl/current/lib/32 /opt/intel/mkl/current/lib/32/libmkl_intel.a /opt/intel/mkl/current/lib/32/libmkl_intel_thread.a /opt/intel/mkl/current/lib/32/libmkl_core.a -lpthread -lm

When I run it without argument (3x3 matrix), it should output "result=300". However, occasionally I got
results like 600 or 900 on a two dual-CPU machine (4 threads), and the problem is from matrix vector
multiplication. When I set thread to one with
OMP_NUM_THREADS=1, or if I move cblas_dgemv line out of parallel region, the result is always correct,
so it seems MKL's cblas_dgemv() creates its own threads, which is not supposed to happen.
I tried MKL and and icpc 10.1.011 and 10.1.015, same results.

Am I doing anything wrong?

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In principle, MKL should not start additional threads inside a parallel region, unless OMP_NESTED is set. For a problem as small as yours, it should not be threaded anyway, and threading should not change the result drastically in any case. You aretakinga chance that icpc links against a different copy of Intel OpenMP library than MKL. All bets are off, if you let that happen, or don't make certain that theright OpenMP library is active at run time to match your link. If you never want MKL to start its own thread, you should link mkl_sequential in place of intel_thread. Otherwise, you should not give the -L path to an MKL copy of Intel OpenMP library and also (implicitly by -openmp) the path to icpc's own copy of that library.

Thanks for your comments. Certainly this is just a small testing program before applying it to large ones
where threading will really help. The linked binary uses following dynamic libraries:
$ ldd main => (0xffffe000) => /lib/ (0x0076b000) => /opt/intel/fc/10.1.015/lib/ (0xf7dab000) => /lib/ (0x00897000) => /opt/intel/mkl/ (0xf7d49000) => /usr/lib/ (0x00971000) => /lib/ (0x00a6f000) => /lib/ (0x00629000) => /lib/ (0x00784000)
/lib/ (0x0060c000)

I tried remove -L... option in compiling/linking, so that:
icpc -openmp -o main main.cpp
-lpthread -lm
and I also tried using the from icc/icpc folder, or also (by adjusting LD_LIBRARY_PATH)
but none of them help. I have also tried with -L option and -lguide or -liomp5 ... not helping either.

Instead of the BLAS routine, I also tried one of VML's routine (vdLog10 to be exact), and
I found that it is working correctly whether it is in parallel region or not.
Am I doing something wrong?

Actually, if you put something under #pragma omp parallel and don't control it with another pragma, it must be executed as many times as the number of threads for the region. No doubt, it would be computed 4 times in four threads, butsince this is done almost simultaneously and write to the same array, the result is unpredictable. Of course, when you set OMP_NUM_THREADS=1 or move it out, the result is what you want.

Please try #pragma omp single before cblas_dgemv to force it to run once.


According to the MKL userguide (page 6.3)

"Intel MKL tries to determine if it is in a parallel region in the program, and if it is, it does not
spread its operations over multiple threads unless the user specifically requests Intel MKL
to do so via the MKL_DYNAMIC functionality (see Using Additional Threading Control for
details). However, Intel MKL can be aware that it is in a parallel region only if the threaded
program and Intel MKL are using the same threading library."

So, ideally using cblas_dgemv() within openmp parallel region should work. But my test doesn't
seem to work. I thought it could be due to data race issue, so I also added #pragma openmp barrier
before cblas_dgemv() but it doesn't solve the problem.
Yes, moving cblas_dgemv() outside of parallel region or using #progma omp single does work.
It however defeats the goal to maximize parallel region and minimize barrier .... maybe I'm
being picky,

Even if your intention is to have dgemv parallelize itself, should it be capable of doing so, there probably is no potential advantage in including it under the single parallel region. Intel OpenMP makes the threads persist, so there should be no excess overhead incurred by consecutive parallel regions. The only way to cut down on barrier time would be to take the risk of adding a nowait, allowing dgemv to proceed before initialization is complete.

The last I knew DGEMV had not been threaded in Intel MKL. The User Guide mentions those parts of MKL that are threaded and only the level 3 BLAS (matrix-matrix operations) are mentioned.

Can you be sure that the discrepancies aren't coming from your sum. I'm afraid I'm not an OpenMP* expert. Maybe you could try keeping MKL in the parallel region, but removing your last pragma above the loop where you find the sum of the elements of a[].


Thanks for your comments.
Yes, I'm 100% sure the problem is not from the sum, as I did print out the resulting vector right after dgemv
and the answers are already wrong (sometimes). If dgemv was not threaded, then I shouldn't have this
problem at all.

There'ssome misunderstanding here. Do you expect dgemv was executed once, or as many as the number of threads?

It doesn't matter whether a called routine is parallel or not - it will be called by any threadfrom the parallel region - it's by OpenMP design. If you want dgemv executed once, the best idea is probably to pull out the call from the parallel region. Another way is to put #pragma omp single.



You're right. in my original code example, dgemv will be called by all threads; and that's why I was
confused by the sentences I quoted for the MKL user guide.
When I read the user guide, I thought MKL routines in a parallel region will automatically split
their calculation for multiple threads to run, and still get correct results like done by a single thread.

Why am I trying to do this?
I'd like to use MKL within a loop with large number of iterations. With fine
grain parallelization within a loop, I certainly like to avoid the overhead due to thread creation,
and that's why I was trying out MKL routines inside the parallel region. Certainly a coarse
grain approach should be considered as well.

So I tried to put dgemv in the parallel region so it simply uses the multple threads already created by
openmp, without creating threads again. If dgemv is outside of parallel region, my understanding is that
dgemv needs to create threads again (is this right?) and there will be overhead.
I'm not sure exactly what happends if using "#pragma omp sinlge" around dgemv inside a
parallel region. Does it simply sets MKL_NUM_THREADS to 1 for dgemv so dgemv only
uses one thread? Nice thing about it is no thread creation overhead for dgemv, but then dgemv
is using only one thread?

Your parallel application starts several instances ofdgemv and every dgemv doesn't know whether there's another instance of code started, or not. That's why it's impossible fordgemv to behave as you expected, that is, be executed only once inside a parallel region.

dgemv is not threaded, so if you have a long parallel region wheredgemv should becalled, just use #pragma omp single above the call to dgemv. It doesn't set mkl_num_threads to 1 - it's just standard OpenMP directive, please refer to OpenMP documentation.

If dgemv was threaded you would need to split parallel region and call dgemv outside the parallel region to make it run parallel, and fortunately itwouldn't cause overhead because run-time OpenMP library is wise and doesn't close threads just immediatelybut let the next parallel region to reuse them instead of creating again.

Practically, if you don't know whether routine is parallel or not, you may always split parallel region and call the routine outside the parallel region, and this wouldn't cause an overhead. #pragma omp single would force any routine run sequentially.


Leave a Comment

Please sign in to add a comment. Not a member? Join today