According to the manual, I should be able to use MKL routine within OpenMP parallel region, and MKL

routine should not create new threads.

I tested one matrix/vector multiplication routine in MKL's BLAS with following code:

========================================================= starting main.cpp

#include

#include

#include "mkl_cblas.h"

using namespace std;

int main(int argc, char** argv) {

int m, n, i,j;

double *a, *b, *c, sum=0.0;

if (argc>2) { m=atoi(argv[1]); n=atoi(argv[2]); } else { m=3; n=3; }

a=new double[m];

b=new double[m*n];

c=new double[n];

#pragma omp parallel default(none) shared(n,m,a,b,c) private(i,j)

{

/* initialize with values for matrix b and vector c */

#pragma omp for

for (j=0; j #pragma omp for

for (i=0; i for (j=0; j

/* Matrix*Vector Calculation a=b*c */

cblas_dgemv(CblasRowMajor, CblasNoTrans, m, n, 1.0, b, n, c, 1, 0.0, a, 1);

}

/* print out vector a and its sum*/

for (i=0; i #pragma omp parallel for reduction(+:sum)

for (i=0; i cout << "result="<

/* clean up */

delete[] a,b,c;

return (0);

}

=============================== end main.cpp

And then compiled it with:

icpc -openmp -o main main.cpp -L/opt/intel/mkl/current/lib/32 /opt/intel/mkl/current/lib/32/libmkl_intel.a /opt/intel/mkl/current/lib/32/libmkl_intel_thread.a /opt/intel/mkl/current/lib/32/libmkl_core.a -lpthread -lm

When I run it without argument (3x3 matrix), it should output "result=300". However, occasionally I got

results like 600 or 900 on a two dual-CPU machine (4 threads), and the problem is from matrix vector

multiplication. When I set thread to one with

OMP_NUM_THREADS=1, or if I move cblas_dgemv line out of parallel region, the result is always correct,

so it seems MKL's cblas_dgemv() creates its own threads, which is not supposed to happen.

I tried MKL 10.0.1.014 and 10.0.3.020 and icpc 10.1.011 and 10.1.015, same results.

Am I doing anything wrong?