tracejob says 'Job is starving'. What does that mean?

tracejob says 'Job is starving'. What does that mean?

Hello,I tried tracejob as follows:[sels@acano01 Debug]$ tracejob 14022[sels@acano01 Debug]$ tracejob 14023[sels@acano01 Debug]$ tracejob 14024Job: 14024.acaad0108/19/2011 00:03:31 L Job is starving08/19/2011 00:08:31 L Job is starving08/19/2011 00:13:31 L Job is starving08/19/2011 00:18:31 L Job is starving08/19/2011 00:23:31 L Job is starving08/19/2011 00:28:31 L Job is starving08/19/2011 00:33:31 L Job is starving08/19/2011 00:38:31 L Job is starving08/19/2011 00:43:31 L Job is starving08/19/2011 00:48:31 L Job is starving[sels@acano01 Debug]$[sels@acano01 Debug]$ qstat -a 14024acaad01: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----14024.acaad01 sels workq retime2log 30040 1 40 -- 10:00 R 02:59[sels@acano01 Debug]$This raises 2 questions:1) Jobs14022 and14023 are finished already, respectively unsucessfully (walltime terminated) and successfully (completed within walltime). But I would expect that I could still see some information about these jobs using tracejob.(I think tracejob used to show info on completed jobs before...)2) What does 'Job is starving' mean? Starving for memory?I suppose that one can set a memory requirement in PBS qsub?...Indeed, as in:qsub -l mem=200mb /home/user/script.shBut does MTL allow minimal mem requirements?Since I required all 40 nodes for this job, how can it be that at 'only' using2301448kb memory(of certainly more available (how much on acano02?))it is starving? Is it starving from other resources than memory?qstat -f 14024gives:[sels@acano01 Debug]$ qstat -f14024Job Id: 14024.acaad01 Job_Name = retime2log Job_Owner = sels@acano01 resources_used.cpupercent = 197 resources_used.cput = 06:17:52 resources_used.mem = 2301448kb resources_used.ncpus = 40 resources_used.vmem = 2980716kb resources_used.walltime = 03:23:30 job_state = R queue = workq server = acaad01 Checkpoint = u ctime = Wed Aug 17 14:05:27 2011 Error_Path = acano01:/home/sels/projects/KUL/RhinoCeros/retime/Debug/retime 2log.e14024 exec_host = acano02/0*40 exec_vnode = (acano02:ncpus=40) Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Thu Aug 18 21:53:30 2011 Output_Path = acano01:/home/sels/projects/KUL/RhinoCeros/retime/Debug/retim e2log.o14024 Priority = 0 qtime = Wed Aug 17 14:05:27 2011 Rerunable = True Resource_List.host = acano02 Resource_List.ncpus = 40 Resource_List.nodect = 1 Resource_List.place = pack Resource_List.select = 1:host=acano02:ncpus=40 Resource_List.walltime = 10:00:00 stime = Thu Aug 18 21:53:33 2011 session_id = 30040 jobdir = /home/sels substate = 42 Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,nThreads=2, PBS_O_HOME=/home/sels,PBS_O_HOST=acano01,PBS_O_LOGNAME=sels, PBS_O_WORKDIR=/home/sels/projects/KUL/RhinoCeros/retime/Debug, PBS_O_LANG=en_US.UTF-8, PBS_O_PATH=/home/sels/xpressmp/bin:/:/opt/pbs/default/bin:/opt/pbs/def ault/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/sels/gur obi451/linux64/bin:/home/sels/cmake-2.8.4/bin:/home/sels/bin, PBS_O_MAIL=/var/spool/mail/sels,PBS_O_QUEUE=routeq comment = Job run at Thu Aug 18 at 21:53 on (acano02:ncpus=40) etime = Wed Aug 17 14:05:27 2011 Submit_arguments = -lncpus=40 -vnThreads=2 retime2log.shI now qdel-ed 14024 since I did not expect it to go well...Now apparantly I can still see tracejob infor on this job, even (seconds) after its termination. (see below)[sels@acano01 Debug]$ tracejob 14024Job: 14024.acaad0108/19/2011 01:26:04 L Job is starving08/19/2011 01:30:08 S delete job request received08/19/2011 01:30:08 S Job sent signal TermJob on delete08/19/2011 01:30:08 S Job to be deleted at request of sels@acano0108/19/2011 01:30:27 S delete job request received08/19/2011 01:30:27 S Job sent signal TermJob on delete08/19/2011 01:30:28 S Job to be deleted at request of sels@acano0108/19/2011 01:30:38 S Obit received momhop:1 serverhop:1 state:4 substate:4208/19/2011 01:30:38 S Obit received momhop:1 serverhop:1 state:5 substate:5108/19/2011 01:30:40 S Post job file processing error08/19/2011 01:30:40 S Released 40 cpu licenses, float avail global 456, float avail local 40, used locally 8008/19/2011 01:30:40 S Exit_status=0 resources_used.cpupercent=197 resources_used.cput=06:44:06 resources_used.mem=2301448kb resources_used.ncpus=40 resources_used.vmem=2980716kb resources_used.walltime=03:37:07[sels@acano01 Debug]$Does tracejob forget info on (very) old jobs then?cheers,Peter

1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.