In my effort to perform ZGEMM on Intel's MIC, I have come across the following problem and would appreciate any help regarding the same. Say, I need to perform: OUT = M*IN, where OUT,M,IN are complex doubles. I have the following where I have multiplied 4 rows of 'M' with a single column of IN and get the following 4 vectors:
O1-> |a8|a7|a6|a5|a4|a3|a2|a1|//M1*IN0
O2-> |b8|b7|b6|b5|b4|b3|b2|b1|//M2*IN0
O3-> |c8|c7|c6|c5|c4|c3|c2|c1|//M3*IN0
O4-> |d8|d7|d6|d5|d4|d3|d2|d1|//M4*IN0
I have to rearrange it into:
O1_new ->|d2|d1|c2|c1|b2|b1|a2|a1|
O2_new ->|d4|d3|c4|c3|b4|b3|a4|a3|
O3_new ->|d6|d5|c6|c5|b6|b5|a6|a5|
O4_new ->|d8|d7|c8|c7|b8|b7|a8|a7|
which is similar to a transpose. How can I achieve this using the C intrinsics for Larrabee with the fewest cycles?




