I'm facing a performance issue with a scientific application(Fortran). The issue is, it runs faster on single node but very slower on multiple nodes. For example, a 16 core job on single node finishes in 1hr 2mins, but the same job on two nodes (i.e. 8 cores per node & remaining 8 cores kept free) takes 3hr 20mins. The code is compiled with ifort-13.1.1, openmpi-1.4.5 and intel MKL libraries - lapack, blas, scalapack, blacs & fftw. What could be the problem here with?
I suspect, it may be problem with Intel MKL libraries(scalapack & blacs), as the hpl benchmark compiled with intel compilers & openmpi produces equivalent/accurate results on both single & multiple nodes runs.