I have a reproducible slowing down of calculations due to something which is going to be hard to track down so I welcome suggestions. I do not know if this is a memory leak, impi related, mkl related or what.
I first noticed it when one newish E5-2660 node was systematically running at ~1/2 the speed of others for mpi calculations, reproducibly. After rebooting it went back to running at the same speed as others, nothing else helped.
I have now reproduced a systematic slowing down for a long calculation. It is shown in the attached with the x axis iteration, the y axis time in minutes. Starting from ~ 7minutes the slowdown is approximately 6.25 seconds/iteration. This is a fairly big calculation with 64 cores and 835m/core (virtual). There is no indication that this is communications (infiniband) related, the slowdown is in CPU and WALL remains very close to this.
Obviously recompiling with debug on is not going to be a viable approach. Also a scatter debug strategy, for instance trying to add calls to release memory from mkl calls is going to be very painful as we are talking about ~1 day to test and >27000 lines of code written by others although I have written parts. Ideal would be innovative ideas to trace down why it has gone slow.
For reference, I am using composer_xe_2013.2.146 and Intel impi. I don't see this on older E5410 nodes but I have not run enough iterations (yet, ongoing) to notice. The calculations are density-functional theory (DFT) calculations controlled by a tcsh script which repeatedly call a sequence of different mpi taks using mpirun (mpiexec.hydra) for each. This is a stable code (www.wien2k.at) used by ~2000 groups so while I cannot completely rule out programming errors it is unlikely to be something simple. The code heavily uses scalapack/blas/lapack calls for everything. In additon to static linking of the mkl scalapack etc the compilation options (ifort on a E5410 ) are (edited to remove some local options):
-FR -axAVX -mp1 -I$(MKLINC) -prec_div -r8 -pc80 -fpconstant -traceback -pad -align -O2 \
-ipo1 -i-static -fminshared -xHost -thread -assu buff -no-complex-limited-range -no-fast-transcendentals
I can provide more information if needed, but I cannot provide a simple code to reproduce the problem.