I am currently benchmarking a small cluster on a vendor's system, and noticed a very severe slowdown with 64 cores, and even slightly for 32 cores. The slowdown is specific to both of these routines, and they take twice as long with 64 cores as they do for 32. And, yes, I really do mean twice as long.
The vendor has scomposer_xe_2011_sp1.11.339which I used for the tests. The mkl fromcomposerxe-2011.3.174 (which I had access to) is slightly better, but not a lot. From /proc/cpuinfo these areIntel Xeon CPU E5-2660 0 @ 2.20GHz machines, 16 cores per node with IB, openmpi-1.4.5. Any suggestions? (It is not a coding issue or anything else, the code being used is a DFT standard.)