Can't get correct benchmark in Intel Xeon Phi

Can't get correct benchmark in Intel Xeon Phi

I am trying to use cholesky factorization in intel mic, but I am not able to get correct performance. 

This is how I run the code:

[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=240 KMP_AFFINITY=proclist=[1-240],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 1.201130, gflops 245.567177               >>>>>>>>> this is warm up
time 0.865080, gflops 340.960515
time 0.865288, gflops 340.878499
time 0.864819, gflops 341.063349
time 0.864337, gflops 341.253577
time 0.863623, gflops 341.535639

The correct performance should be 500 gfops when size is 9600*9600

here is the strange results when I only use 1 core:

[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=4 KMP_AFFINITY=proclist=[1-4],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 0.902745, gflops 326.734658
time 0.871131, gflops 338.592037
time 0.870778, gflops 338.729428
time 0.868808, gflops 339.497416
time 0.866140, gflops 340.543143
time 0.864064, gflops 341.361391

This is significantly not corret, looks like the program is mess up with cores. Anyone can help me figure out where is the problem. 

The attachment is my code, there is really nothing in it, just call lapacke_dpotrf

BTW, the following is how I compile my code

testing_native_dpotrf: testing_native_dpotrf.c
icc -O3 -mmic -mkl $< -o $@

Downloadtext/x-csrc testing-native-dpotrf.c1.74 KB
1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.