Memory continous increase When I run ccsm with intel mpi,why?

Memory continous increase When I run ccsm with intel mpi,why?

I run ccsm whith intel mpi ,memory is small when starting,It is about 1G.But,Memory continous increase.it is about 1 hours later,Memory is 24G,my system is hung,I must restart system.I used:l_mpi_pu_4.0.0.027l_cproc_p_11.1.069
l_cprof_p_11.1.069
l_mkl_p_10.2.4.032.tarAnd I use Qlogic infiniband switch and HCA apatermpirun -nolocal -machinefile mpd.hosts -genv I_MPI_FABRICS tmi \\ -np $NTASKS[1] $EXEROOT/all/$COMPONENTS[1] : \\ -np $NTASKS[2] $EXEROOT/all/$COMPONENTS[2] : \\ -np $NTASKS[3] $EXEROOT/all/$COMPONENTS[3] : \\ -np $NTASKS[4] $EXEROOT/all/$COMPONENTS[4] : \\ -np $NTASKS[5] $EXEROOT/all/$COMPONENTS[5]

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Myintegrate scripts:
#!/bin/csh -f#===============================================================================# This is a CCSM batch job script for latecomer#===============================================================================## BATCH INFO#$ -S /bin/csh -cwd#$ -N ccsm_latecomer#$ -pe climate 208#$ -q climateq#-----------------------------------------------------------------------# Determine necessary environment variables#-----------------------------------------------------------------------cd /lustrefs/soa02/home/ccsm/model/ccsm3/roytest/test1setenv MACH latecomersource env_conf || "problem sourcing env_conf" && exit -1 source env_run || "problem sourcing env_run" && exit -1source env_mach.latecomer || "problem sourcing env_mach.latecomer" && exit -1## Warning: SCRATCH not defined in system environment. Set SCRATCH to be /lustrefs/soa02/home/ccsm/model/ccsm3/#-----------------------------------------------------------------------# Resolved task/thread counts# This is provider as user information only# These are csh comments, DO NOT UNCOMMENT#-----------------------------------------------------------------------### COMPONENTS = ( cpl csim clm pop cam )### NTASKS_CPL=16 NTHRDS_CPL=1### NTASKS_ICE=16 NTHRDS_ICE=1### NTASKS_LND=16 NTHRDS_LND=1### NTASKS_OCN=96 NTHRDS_OCN=1### NTASKS_ATM=64 NTHRDS_ATM=1#-----------------------------------------------------------------------# Determine time-stamp/file-ID string#-----------------------------------------------------------------------setenv LID "`date +%y%m%d-%H%M%S`"# -------------------------------------------------------------------------# Run machine dependent module commands# -------------------------------------------------------------------------if (-f modules.$MACH) then echo sourcing modules.$MACH source modules.$MACH || exit 1endif# -------------------------------------------------------------------------# Build the models# -------------------------------------------------------------------------./${CASE}.${MACH}.build || exit 1# -------------------------------------------------------------------------# Create processor count input files# -------------------------------------------------------------------------cp mpd.hosts $EXEROOT/allcd $EXEROOT/all @ PROC = 0 # counts total number of tasksecho "0" >! mpirun.pgfile1;foreach n (1 2 3 4 5) set comp = $COMPONENTS[$n] set model = $MODELS[$n] set nthrd = $NTHRDS[$n] set ntask = $NTASKS[$n] @ M = 0 while ( $M < $ntask )# @ M++ # @ PROC++ if (($n == 1) && ($M == 0)) then echo "skipping first model" else echo "1 $EXEROOT/all/$comp" >>! mpirun.pgfile1; endif @ M++ @ PROC++ end ln -s $EXEROOT/$model/$comp $EXEROOT/all/. # link binaries into all dirend# -------------------------------------------------------------------------# Run the model# -------------------------------------------------------------------------env | egrep '(MP_|LOADL|XLS|FPE|DSM|OMP|MPC)' # document env varscd $EXEROOT/allecho "`date` -- CSM EXECUTION BEGINS HERE"#mpdboot -n 50 -r rsh -f /lustrefs/soa02/home/ccsm/model/ccsm3/roytest/test1/mpd.hosts#mpiexec -nolocal -genv I_MPI_FABRICS tmi -genv I_MPI_DEBUG 5 \mpirun -nolocal -machinefile mpd.hosts -genv I_MPI_FABRICS tmi -genv TMI_DEBUG 1 \#mpirun -machinefile mpd.hosts -genv -I_MPI_TMI_PROVIDER psm \#mpirun -machinefile mpd.hosts \ -np $NTASKS[1] $EXEROOT/all/$COMPONENTS[1] : \ -np $NTASKS[2] $EXEROOT/all/$COMPONENTS[2] : \ -np $NTASKS[3] $EXEROOT/all/$COMPONENTS[3] : \ -np $NTASKS[4] $EXEROOT/all/$COMPONENTS[4] : \ -np $NTASKS[5] $EXEROOT/all/$COMPONENTS[5] wait#mpdallexitecho "`date` -- CSM EXECUTION HAS FINISHED"# -------------------------------------------------------------------------# Save model output stdout and stderr# -------------------------------------------------------------------------cd $EXEROOT/cplset CplLogFile = `ls -1t cpl.log* | head -1`grep 'end of main program' $CplLogFile || echo "Model did not complete - see $CplLogFile" && exit -1cd $EXEROOTgzip */*.$LIDif ($LOGDIR != "") thenif (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld" && exit -1cp -p */*buildexe*$LID.* $LOGDIR/bld || echo "Error in copy of logs " && exit -1cp -p */*log*$LID.* $LOGDIR || echo "Error in copy of logs " && exit -1endif# -------------------------------------------------------------------------# Perform short term archiving of output# -------------------------------------------------------------------------if ($DOUT_S == 'TRUE') thenecho "Archiving ccsm output to $DOUT_S_ROOT"echo "In $CASEROOT directory using the short term archiving script ccsm_s_archive.csh"cd $CASEROOT; $UTILROOT/Tools/ccsm_s_archive.cshendif# -------------------------------------------------------------------------# Submit longer term archiver if appropriate# -------------------------------------------------------------------------if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') thenecho "Long term archiving ccsm output using the script $CASE.$MACH.l_archive"qsub $CASE.$MACH.l_archiveendif# -------------------------------------------------------------------------# Resubmit another run script# -------------------------------------------------------------------------set echocd $CASEROOTsource env_runif ($RESUBMIT > 0) then echo RESUBMIT is $RESUBMIT @ RESUBMIT = $RESUBMIT - 1 echo RESUBMIT is $RESUBMIT sed '1,/^ *setenv *CONTINUE_RUN .*/s//setenv CONTINUE_RUN TRUE/' \ env_run > env_run.tmp; mv env_run.tmp env_run sed "s/^ *setenv *RESUBMIT .*/setenv RESUBMIT $RESUBMIT/;" \ env_run > env_run.tmp; mv env_run.tmp env_run qsub $CASE.$MACH.runendifendif

Hi Zhang,

Just to narrow down the problem, could you please try to use another provider?

I'll contact with the author of the tmi provider and let him know about potential memory leak.

Regards!
Dmitry

Thank you!
You mean this problem is provider ? Provider is tmi , tmi caused this error?If I use rdma and shm:tmi ,I will get a error in begining.

Zhang,

I don't know the real reason of that error. Might be this is application itself consumes memory - who knows.

What error do you get? Could you provide details? Your command line and output with I_MPI_DEBUG set to 9 could help to understand to reason of these fails.

Regards!
Dmitry

Hi Zhang,

I've got an answer from the developer of TMI module: "A memory leak was
recently discovered in the tmi module with non-contiguous
messages. It was fixed."
Unfortunately I don't know when updated library will be available. If you need new library you need to create a tracker at http://premier.intel.com.

BTW: It would be better to use "-env I_MPI_FABRICS shm:tmi" - shared memory will be used in case as well.

Regards!
Dmitry

Thank you!

I want to knowwhere I can get the updated library even if it is a beta version?

Can you get the updated library from the developer of TMI module?

Zhang,

Unfortunately we cannot provide it on ISN forum. You need to submit a tracker via premier.intel.com and we will be able to attached new library to that tracker.
The library need to be built. That issue has just been fixed and new library is not ready yet.

Regards!
Dmitry

I can't login inpremier.intel.com,why? How do I can login in?

Welcome to Intel Premier Support.

We were unable to authenticate your access to the Intel Premier Support web site. Please check that your login ID and password were entered correctly and that the URL used was "https://premier.intel.com".

If you have forgotten your login or password, the fastest method to gain access to the system is to use the automated login and password links Forgot your password or Forgot your Login ID on the login page.

If you continue to have problems, please contact Intel Customer Support via email at quadsupport@mailbox.intel.com.

Zhang,

If you buy Intel product you can register at http://registrationcenter.intel.com
And you'll get Login ID and password. After that you can submit a request at Premier.

Have you registered your product?

Regards!
Dmitry

Yes ,I can login in now ,thank you!

Leave a Comment

Please sign in to add a comment. Not a member? Join today