In my effort to perform ZGEMM on Intel's MIC, I have come across the following problem and would appreciate any help regarding the same. Say, I need to perform: OUT = M*IN, where OUT,M,IN are complex doubles. I have the following where I have multiplied 4 rows of 'M' with a single column of IN and get the following 4 vectors:
I have to rearrange it into:
which is similar to a transpose. How can I achieve this using the C intrinsics for Larrabee with the fewest cycles?