Why does my job not start running?

Why does my job not start running?

Hello,I have a job that requires nproc=38 and acano02 to run.qstat gives me:$ qstatJob id Name User Time Use S Queue---------------- ---------------- ---------------- -------- - -----14193.acaad01 boost-100000-80 aumn-s01 02:02:18 R workq 14194.acaad01 boost-200000-80 aumn-s01 00:55:39 R workq 14195.acaad01 boost-200000-4- aumn-s01 00:23:17 R workq 14196.acaad01 boost-100000-4- aumn-s01 0 Q workq 14197.acaad01 ppp-100-4--n aumn-s01 0 Q workq 14198.acaad01 ppp-100-16--n aumn-s01 0 Q workq 14201.acaad01 ppp-40000-40--n aumn-s01 0 Q workq 14202.acaad01 ppp-40000-25--n aumn-s01 0 Q workq 14203.acaad01 app_v5 cfc-s01 0 Q workq 14204.acaad01 app_v5 cfc-s01 0 Q workq 14205.acaad01 app_v5 cfc-s01 0 Q workq 14208.acaad01 TGScaling ckrieger 0 Q workq 14209.acaad01 TGScaling2 ckrieger 0 Q workq 14210.acaad01 retime2log sels 0 Q workq 14211.acaad01 retime2log sels 0 Q workq 14212.acaad01 retime2log sels 0 Q workq 14213.acaad01 retime2log sels 0 Q workq 14214.acaad01 retime2log sels 0 Q workq 14215.acaad01 retime2log sels 0 Q workq $Only job14195 is running on acano02 and needs only 2 cpus.from qstat -f 14195Resource_List.host = acano02Resource_List.ncpus = 38Resource_List.nodect = 1Resource_List.place = packResource_List.select = 1:host=acano02:ncpus=38comment = Not Running: No available resources on nodesfrom qstat -f 14193:comment = Job run at Thu Aug 18 at 23:53 on (acano04:ncpus=40)from qstat -f 14194:comment = Job run at Fri Aug 19 at 00:58 on (acano03:ncpus=40)from qstat -f 14195:comment = Job run at Fri Aug 19 at 01:30 on (acano02:ncpus=2)So I am thinking, if I only need 38 cpus, my job should be starting right? Instead it remains queued.Or is there no out of order execution? strictly first come first serve?Even if later jobs have lower requirements that can be satisfied earlier?...If so, why does qstat then mention 'No available resources on nodes'?... (io just 'have to queue' or so)thanks and best regards,Peter

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Currently all batch nodes are configuredin exclusive mode, this means that once a job is scheduled, no matter how many CPUs it requires, it will lock-out any job that requires the remaining "free"CPUs, until it finishes.

The reason for this, is that many users typically don't specify the ncpus that they require for their jobs, but create (maybe automatically with say OpenMP) the maximum number of threads on the batch node. Thus we need to ensure that all jobs have exclusive access to a batch node, no matter how many CPUs they specify when running qsub.

This can be a problem when users only use 1 or 2 CPUs on the batch nodes and tie up the node for many hours. We monitor the nodes and will inform users that theiruse of this node is not the most efficient usage model when other users are standing in linewanting to use the majority of the CPUs.

based on user feedback, we're in the process of trialing a change to our batch system here on the MTL.

We will modify one or more of our batch nodes to run as 'default_shared, this will allow multiple users to request as many CPU resources that they need to run their jobs, but allow the remainder of the unused CPUs on that particular system to be made available to other users up to 40-cores.

As most MTL users are running threaded jobs, we need to try and tie the CPUs to the number ofthreads - typically one thread per CPU. To do this we will (soon) make the ncpus argument to the qsub command arequirement, rather than defaulting to 1. Thus if a user requires only a single-threaded/single CPU job, then they need to specify ncpus=1, if a user wants to run a job with multiple threads/cores then that number of ncpus needs to be specified, currently up to 40

For users thatrun jobs that use TBB, OpenMP or some other library (including Java)that automaticaly handles thread creation, it is suggested that users specify ncpus=40 so to allocated all the CPU resources on aparticilarbatch node.

As I said we're implementing this as a trial - we welcome your feedback on this approach and we hope itl will provide a fairer resource model for our users going forward.

Leave a Comment

Please sign in to add a comment. Not a member? Join today