The last serial part of my application is a call to DSYEVR. My attempts to parallelize it resulted in very strange behavior hope someone help me to understand.
Depending on the data I run DSYERV alone or two/three of them in a OMP PARALLEL SECTIONS. My application is compiled with icc on Cray with MKL 10.3 update 3 (the parallel version). The matrices are small, 61x61.
As suggested elsewhere, I call omp_set_nested(1), mkl_set_dynamic(0) and mkl_set_num_threads(n) (n: 1-8) at the beginning of the code. Then run my application on a varying number of threads (1-16).
With the above setup the performances drops dramatically going above 2 threads whathever number of threads I reserve to MKL.
To check my code I linked with --mkl=sequential and the scaling is what I expected. So I presume the culprit is MKL and its interactions with omp_set_nested.
I implemented also the "fake nesting" suggested in this forum (cannot find the reference anymore, but was about starting more threads than requested by OMP_NUM_THREADS) and there is a small speed advantage running on 4 nodes, but overall the scaling does not change. I interpret this as no parallelization of the DSYEVR calls.
Any idea? This call is clearly reducing my code scalability as seen also with profilers as Vampir.