I am trying to use the Larrabee intrinsics to mimic MKL's ZGEMM performance on the MIC and my main bottleneck is this:

For C=A*B, if I assume 'A' is in row major, 'B' is in column major and 'C' is in column major, I can use register tiling by taking 4 rows of 'A' and mutliply them with 1 column of 'B' i.e read 4 blocks of 'A' and 'B' into two 512-bit vectors(with each block containing two doubles, the real and imaginary parts), multiply them using a combination of swizzles and FMA/S and get the result in a 512-bit vector.

So this will be of the form: |im40|re40|im30|re30|im20|re20|im10|re10| /*A[0]*B[0]*/. Similarly I'll get 3 more vectors for A[1]*B[0], A[2]*B[0], A[3]*B[0].

So, I'll have the following 4 vectors:

|im40|re40|im30|re30|im20|re20|im10|re10| /*A[0]*B[0]*/

|im41|re41|im31|re31|im21|re21|im11|re11| /*A[1]*B[0]*/

|im42|re42|im32|re32|im22|re22|im21|re21| /*A[2]*B[0]*/

|im43|re43|im33|re33|im23|re23|im13|re13| /*A[3]*B[0]*/

which need to be permuted into:

|im13|re13|im21|re21|im11|re11|im10|re10|

|im23|re23|im22|re22|im21|re21|im20|re20|

|im33|re33|im32|re32|im31|re31|im30|re30|

|im43|re43|im42|re42|im41|re41|im40|re40|

I can now add these 4 vectors to get one 512-bit vector, which I can then store as 4 contiguous blocks of C[0]. However, the above permutation causes a significant overhead and my performance is just ~180 Gflops.

If however, I assume 'A' also as column major, I can simply broadcast the first block of 'B' into one vector, and multiply this with the first column of 'A' i.e.

|im3|re3|im2|re2|im1|re1im0|re0| /*|A[3][0]|A[2][0]|A[1][0]|A[0][0]|*/

im0|re0|im0|re0|im0|re0|im0|re0| /*B[0][0] broadcasted*/

When these two get multiplied I get the first 4 blocks of C[0] and I continue in a similar manner.

This gives a much better performance of ~360 Gflops.

Assuming that MKL's ZGEMM gives performance > 800 Gflops, how can I improve my code algorithmically to mimic MKL's performance? Any ideas?