After some efforts, I have managed to port my simulation code to the Xeon Phi. It is usually runs with MPI on a cluster of PCs, but I wanted to see what kind of performance I would get from the Phi. The code relies quite a lot on MKL routines (Scalapack, MKL implementation of FFTW, a few GSL routines and so on).
I first ran the code as single threaded MPI (as I do on my cluster) with 10-60 MPI processes. I then tried to improve performance by adding threading through compiler option (my thinking was that although my code itself isn't threaded, the MKL routines might benefit from running multiple threads inside an MPI process and that the compiler could also add some paralellism). This brought some performance gain.
I am still however far from the speed I get on the cluster. I then ran Vtune Amplifier, to see if it could help me find out where the bottlenecks are. I would have thought that some of my functions would appear as the culprits, and that I could start improving from there. But no, the main bottlenecks are the MPI and threading libraries, vmlinux and mkl_core (see attached screen capture). I have tried to play with affinity and such, but it hasn't brought me much. I did optimize the ratio of threads and MPI processes - and that improved the performance somewhat.
So what does this mean ? Are the MPI calls not very efficient, and I should try to replace MPI calls by threading or re-think my paralellization scheme? How do I figure out which MPI calls are the most time consuming ? Or should I just concentrate on the functions which appear below (like phase2psfcube_float_function), and improving on them will also have on impact on the library calls above ?
Thanks in advance for your help and ideas !