I have a piece of code which uses both openMP and MPI and I wish to profile it in different configurations. e.g.
One Haswell node with 20 cores in following configurations
1. 20 MPI tasks and no openMP parallelization or 1 openMP thread
2. 10 MPI tasks and 2 openMP threads per task
3. 4 MPI tasks and 5 openMP threads per task
4. 2 MPI tasks and 10 openMP threads per task
I am running completely independent tasks (linear solvers) with different data sets so among MPI tasks there is NO communication. The reason i have MPI tasks is because in future I would like to have a more fine-grain parallel task that can also use cores form nodes on the infiniband network.
I had expected that for the matrix I am using for the linear solver I would see more than 10 times improvement between variant 1 and 4. What this means is that in 1 say all 20 tasks finish in 130 seconds (maximum time taking task). I see that 4 finishes in 13 seconds but then in order to complete all the work I must run 4 10 times. This results in 130 seconds so the gain in parallelizing with openMP is absent.
This is what I wish to understand with a tool or a set of tools. I was advised by my cluster administrator to use Vtune for openMP analysis and ITAC for MPI analysis.
I am wondering is there an integrated way of looking at the possible issues with my test? Kindly advise.
with kind regards and thanks in advance for reading my message
P.S.:- Please note that in order to get these numbers I used the knowledge provided in articles listed
So in my code I use options listed on this page to map processes to cores