I am trying to run an atmospheric code on Itanium-2 (2 x 4-way smp) with Rocks. The code is mixed F77/90 and uses domain decomposition for parallelisation. It contains only point-point mpi communications.
I have used following mpi with same results:
a. Intel_mpi_10 (IFC 8.0 & IFC 9.0)
b. Intel MPI 2.0 (beta) (IFC 9.0)
c. mpich-1.2.7 (IFC 8.0 & IFC9.0)
1. Default settings (0-8 cpus): Initialises MPI but exits with segmentation fault: (Rank # of Task # on cluster.hpc.org_XXX caused collective abort of all processes. Code exits with signal 11).
This occurs while calling a subroutine having a long list of variable declarations (F77).
2. After setting stacksize (ulimit -s) to unlimited:
a. upto 4 cpu: Root enters the above subroutine and waits at first MPI communication, other processes do not enter so code hangs.
b. above 4 cpus (across nodes): All processes enter the above subroutine, do mpi communications normally, but same situation as 2a while calling another similar subroutine (numerous variable declarations).
I have carried out some tests by reducing the variable declarations in these subroutines and found that the above error occurs when the number of variable declarations passes a certain limit.
I used to get similar error on a SGI-Altix (Itanium-2, 16-way SMP, IFC 8.0) with LAM mpi, which got solved by using the SGI MPI (MPT).
Can anyone suggest any solution/ additional details required??