Dear forum users,
my MPI codes are linked with Intel MKL, and intensively use PARDISO sparse direct solver. These MPI codes are actually MPMD in the sense that the last MPI task in the communicator does not execute the same code as the rest. This last task is devoted to a quite time/memory hungry computation using Intel MKL PARDISO (in mult-threaded mode). As a consquence, our MPI codes need a quite particular placement of MPI processes/threads to nodes/cores on the underlying distributed-memory computer (with 16 cores per node). In particular, assuming 16*N + 1 MPI tasks are spawn, the first 16*N tasks have to be mapped to N nodes with 16 tasks per node, and one thread per process, and the last remaining task to one dedicated node with one MPI task per node and 16 threads per process, so that the last task should have access to all the memory/cores within a node. This particular mapping is achieved with the help of a OpenMPI hostfile+rankfile, and a wrapper launch script that controls MKL_NUM_THREADS depending on the MPI task identifier.
When executing the parallel codes with 16001 MPI tasks under the aforehementioned mapping, a segmentation fault was produced within MPI task id 16000:
[1,16000]<stderr>:forrtl: severe (174): SIGSEGV, segmentation fault occurred
[1,16000]<stderr>:Image PC Routine Line Source
[1,16000]<stderr>:libmpi.so.1 00002B547BF5CA8A Unknown Unknown Unknown
[1,16000]<stderr>:libmpi.so.1 00002B547BF5D917 Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_core.so 00002B547AAA1FD1 Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_core.so 00002B547B520484 Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_core.so 00002B547B53A0A7 Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_core.so 00002B547B40E144 Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_core.so 00002B547B40E27E Unknown Unknown Unknown
[1,16000]<stderr>:libmkl_intel_lp64 00002B5479531596 Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 00000000006FD95A Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 00000000006FBD4A Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 00000000006E02EE Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 00000000005C514B Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 0000000000575777 Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 000000000043E61C Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 000000000043B1FC Unknown Unknown Unknown
[1,16000]<stderr>:libc.so.6 0000003B77E1ECDD Unknown Unknown Unknown
[1,16000]<stderr>:par_test_dd_metho 000000000043B0F9 Unknown Unknown Unknown
MPI Task Id. 16000 was allocated one entire node (16 cores + 64 Gbytes) . I have thoroughly tried to find the cause of this segmentation fault without success yet. I know that tt is produced during the sparse direct factorization of a matrix within Intel MKL (PARDISO). I could extract the matrix from the parallel program into a file, and factorize it in isolation with a sequential program linked against Intel MKL (PARDISO), and the program consumed 5.1 GBytes, no segmentation fault was produced, so that a bug in INTEL MKL (PARDISO) codes can be discarded. It should be something related with the parallel environment. I guess that some kind of limit is being exceeded (e.g. stack size?), but I can not confirm it. stack size is unlimited (i.e., ulimit -s unlimit). I have also tried OMP_STACKSIZE=32M, and MKL_DYNAMIC=FALSE without success.
Do you have any idea of what could be the cause of this SIGSEV. I could reproduce it in two different machines. As additional info, this seg fault does not arise at all when the dimension of the matrix task Id 16000 has to factorize is smaller.
Thanks in advance.
Best regards,
Alberto.




