I have been running a benchmark code that calls an MKL subroutine, but have been noticing strange performance differences between two different methods of running natively on the MIC card. The first method I have tried is to directly logon to the MIC card and run the code as follows. I have already copied over the required libraries.
$ ssh root@mic0
$ export OMP_NUM_THREADS=120
$ export KMP_AFFINITY=scatter
The performance using this method is about 480 GFLOPS. However, if I try to use the micnativeloadex utility, then I get much worse performance.
$ /opt/intel/mic/bin/micnativeloadex a.out -d 0 -e "OMP_NUM_THREADS=120 KMP_AFFINITY=scatter"
Running this way I achieve a performance of only 340 GFLOPS. This is the performance I would get if I did not set either environment variable and used the defaults, but I thought the -e option was supposed to pass them. What am I doing wrong?