I am on a corei7 quad core machine with ASUS P9X79WS motherboard and Xeon Phi 3120A card installed.
Operating system is RHEL 6.4 with mpss 3.1 for phi and parallel_sutdio_2013 SP1 installed.
Just for detail, the phi card has 57 cores, with capability of about 1003 GFlops for double precision.
I am seeing some performance issues that I don't understand.
When I time MKL's parallel DGEMM on phi card, it is getting 300GFlops, which is about 30% of peak.
Note that I am doing native execution.
Now this performance is not matching what is posted here http://software.intel.com/en-us/intel-mkl/ (achieving about 80% of peak).
So, my first question is, is this difference solely because I am using low-end phi card so there are limitations?
After seeing this, I wrote a test program that tries to achieve peak with assembly language.
The function is simple. It runs a loop that iterates for 25,000,000 times and in each iteration, I am doing 30 independent FMA instructions and unrolled 8 times. So the total flop in each iteration is 30x8x2x8 = 3840 . Note that, this means I am doing 25000000x3840 floating point operations without accessing any memory.
Now if I run this code serially, I get 8.74 Gflops which is basically the serial peak (8.8 Gflops).
If I run this code in parallel with 2 threads on 1 core, I get 17.4 Gflops which is basically the peak for 1 core (17.6 Gflops).
Now the problem is, if I run the same code in parallel with 2 threads per core and using 56 cores (112 threads), I only get 89% of peak.
But if I run it with 4 threads per core i.e. a total of 224 threads, I get 99% of peak, which is what I expect.
So, my second question is, even when I have no memory access at all, why did I need 4 threads to achieve peak?
Is there any other latency that we don't know about that gets hidden by 4 threads per core?
Can someone please clarify?
Sorry for the long post and Thank you for reading.