In a CFD code for the MIC architecture that I was experimenting with, a significant amount of time is spent on a 3D array transposition. Trying to optimize that operation, I started to look into 2D transposition first. Literature suggests that in-core transposition can be improved with loop tiling or recursive divide-and-conquer (AKA cache-oblivious method). I have implemented these methods in Cilk Plus and OpenMP trying to find the best strategy. My results are described in a white paper that I have just posted: http://research.colfaxinternational.com/post/2013/04/25/Transposition-Xe...
Here are the key findings of this research, some of which are also unsolved problems:
1) I have found that 2D transposition for very large matrices (much bigger than the L2 cache) is more efficient on the Xeon Phi than on a two-socket Xeon host system. In that optimal case, the transposition rate is 25-30% of the streaming bandwidth.
2) However, for smaller matrices that fit in the L2 cache, the host beats the Xeon Phi by a large factor. I tracked down this issue to thread scheduling overhead in Cilk Plus (no surprise here, 240 threads take more time to schedule than 32 threads).
3) OpenMP has a lower overhead in this case. For small matrices, I can squeeze more performance out of the Xeon Phi with OpenMP than with Cilk, however, Xeon Phi is still slower than the host.
4) At the same time, OpenMP is not "just better": OpenMP loses to Cilk at large matrix sizes.
5) OpenMP also loses to Cilk on the host for all matrix sizes (except when the data does not leave the L2 cache).
So the result is that there is no single optimal framework or method among those that I tested. Maybe someone could suggest a better way to do small matrix transposition on Xeon Phi.