I am trying to use MKL PBLAS/ScaLAPACK routine pdgetrf to do the LU decompostion. I wrote a simple test fortran program and it worked well with 2*2 processes on the cluster. However, when I tried to use more processes, like 'mpiexec -n 16', The program got stuck.
One possible reason might be that the BLAS spawns too many threads which lead to a performance disaster ( for ref: https://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=6&t=3371 ). So I tried to export OMP_NUM_THREADS=1 or MKL_NUM_THREADS=1, set different combinations of pbs -l select=:ncpu:mpiprocs: to submit the job. But none of them solved the problem.
I have no idea now why it is fine with 2*2 procs but fails with 4*4 or more procs, hope someone here can help me. Any suggestion would be greatly appreciated.
Cluster compiler info:
Intel® Fortran Composer 13.0.1 and MPICH 3.0.