I have a code that I have tested and it is working perfectly with MPI. Since I am using only 1D Decomposition ( I am decomposing the domain into strips) I want to use OpenMP in the other direction. The reason is because I do not want to endup working with cubes because for IOW is a headache. In some instances each rank has a unique size in the direction I used MPI to send and receive messages.
Therefore, I implemented:
!$OMP PARALLEL PRIVATE (i,j) !$OMP DO ..... !$OMP END DO !$OMP END PARALLEL
Since I am submitting my job in a cluster I am setting my variable as "export OMP_NUM_THREAD=N" Where N is the number of threads.
The OpenMP version was also tested and it worked perfectly, and it speed things up as I wanted when I use it alone. However, in this case I found a very weird results. This particular case my arrays is 4001x4001. If I spread out my problem with 20 processes I will have close to 200 nodes in the direction of MPI and 4001 in the OpenMP direction. In other words, I will have 200 nodes in the horizontal direction on my computation and 4001 in the vertical direction.
It turns out that the version with only MPI takes 4.37 seconds to run and the version with 2 threads takes 469.83 seconds. Both runs are with the same number of MPI processes (20). If I set my OMP_NUM_THREAD =1 the time is still high. In the latter case I expected it to be the same or close to the pure MPI run.
Both runs have the same optimization flags and so forth.
I will greatly appreciate your ideas and suggestions