I am seeing some performance regression with MKL2017/2018 with zgemm3m
zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'
This behaviour appeared in MKL 2017 and MKL 2018 but is not in MKL 2015
The call to zgemm3m takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.
transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122
We are essentially looping and calling zgemm3m with the same dimensions and matrix structure each time through the loop.
The loop is not OpenMP parallelized. Running in the "main" thread.
First time through the loop, zgemm3m uses all cores
Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).
It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.
int numThreads = MKL_Get_Max_Threads(); cout << "MKL Threads " << numThreads << endl; MKL_Set_Num_Threads(numThreads); int numOMPThreads = omp_get_max_threads(); cout << "OMP Threads " << numOMPThreads << endl; omp_set_num_threads(numOMPThreads); mkl_set_dynamic(false); zgemm3m(....)
The output of above code trying to force the expected behaviour is always
MKL Threads 4
OMP Threads 8
What would cause zgemm3m to "turn off" threading?