I need to compute products of large distributed matrices. For that I use a hybrid MPI/OpenMP strategy. As a check I performed some runtime tests using the functions DGEMM and PDGEMM on a single node of a cluster using 48 cores with OpenMP. I assumed that these two functions should essentially be equivalent on a single node. However, I observed a dramatic performance difference: The DGEMM function scales almost perfectly with the number of OpenMP threads and I find a runtime of about 7sec for calculating the product of two dense 10000x10000 matrices using 48 cores. For PDGEMM the scaling behavior is much worse and the same operation takes about 60 sec for the same matrix on the same machine. Is this behavor understood and is there a way to fix it?
I use the Intel C++ compiler 12.1 and the corresponding MKL library. The link chain looks like:
/usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_scalapack_lp64.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_blacs_openmpi_lp64.a -Wl,--start-group /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_intel_lp64.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_intel_thread.a /usr/local/intel-lcs-2012.0.032/mkl/lib/intel64/libmkl_core.a -Wl,--end-group