According to http://www.intel.com/content/dam/www/public/us/en/documents/performance-... the Phi 5110P is capable of a theoretical double precision performance of 1011 GFlop/s, and can achieve a practical 833 GFlop/s with DGEMM. The slides indicate that this was measured with 7680x7680 matrices.
Using the methodology described at http://software.intel.com/en-us/articles/a-simple-example-to-measure-the... I've attempted to duplicate these results using MKL 2013_sp1.0.080. My standalone (i.e. not off-load) test program is unable to achieve more than 527 GFlop/s, approximately half of the theoretical maximum and only 63% of what Intel advertises.
I've tried using huge pages, but that reduced throughput by about 1%.
What can I do to achieve better throuput with DGEMM? Does Intel have a sample program which demonstrates the claimed 833 GFlop/s?
I've attached the program I used to measure the performance.
(I would have posted this on the Premier Support forum, but our support account is still not working!)