I am trying to use the Larrabee intrinsics to mimic MKL's ZGEMM performance on the MIC and my main bottleneck is this:
For C=A*B, if I assume 'A' is in row major, 'B' is in column major and 'C' is in column major, I can use register tiling by taking 4 rows of 'A' and mutliply them with 1 column of 'B' i.e read 4 blocks of 'A' and 'B' into two 512-bit vectors(with each block containing two doubles, the real and imaginary parts), multiply them using a combination of swizzles and FMA/S and get the result in a 512-bit vector.
So this will be of the form: |im40|re40|im30|re30|im20|re20|im10|re10| /*A*B*/. Similarly I'll get 3 more vectors for A*B, A*B, A*B.
So, I'll have the following 4 vectors:
which need to be permuted into:
I can now add these 4 vectors to get one 512-bit vector, which I can then store as 4 contiguous blocks of C. However, the above permutation causes a significant overhead and my performance is just ~180 Gflops.
If however, I assume 'A' also as column major, I can simply broadcast the first block of 'B' into one vector, and multiply this with the first column of 'A' i.e.
im0|re0|im0|re0|im0|re0|im0|re0| /*B broadcasted*/
When these two get multiplied I get the first 4 blocks of C and I continue in a similar manner.
This gives a much better performance of ~360 Gflops.
Assuming that MKL's ZGEMM gives performance > 800 Gflops, how can I improve my code algorithmically to mimic MKL's performance? Any ideas?