I have a Fortran code that runs both MPI and OpenMP. I have done some profiling of the code on an 8 core windows laptop varying the number of mpi tasks vs. openmp threads and have some understanding of where some performance bottlenecks for each parallel method might surface. The problem I am having is when I port over to a Linux cluster with several 8-core nodes. Specifically, my openmp thread parallelism performance is very poor. Running 8 mpi tasks per node is significantly faster than 8 openmp threads per node (1 mpi task), but even 2 omp threads + 4 mpi tasks runs was running very slowly, more so than I could solely attribute to a thread starvation issue. I saw a few related posts in this area and am hoping for further insight and recommendations in to this issue. What I have tried so far ...
1. setenv OMP_WAIT_POLICY active ## seems to make sense
2. setenv KMP_BLOCKTIME 1 ## this is counter to what I have read but when I set this to a large number (25000) code is very slow
3. removed some old "unlimited" limit settings (viz., stacksize, coresize) that I have had since "dawn of time." This also helped openmp thread performance significantly.
It seems I am looking for ways to reasonably assure my OpenMP threads don't vanish between the parallel regions I have in the code and making sure these threads are as system-wise lightweight as possible. These above corrections do not seem to impact mpi tasking. Are there any other
recommendations? By the way, the mpi tasks are using an mvapich library on a cluster with IB. The code is compiled with "-openmp" (-Qopenmp).
Thank you in advance.