Intel MPI and torque integration: the "elapsed time" displayed under torque is always 00:00

Intel MPI and torque integration: the "elapsed time" displayed under torque is always 00:00

Hi,

We have a little cluster with Oscar/CentOS 5.5. We are using torque and the Intel Cluster Toolkit. Torque and the ICT are configured and Jobs are running without problem at the moment. But the "elapsed time" displayed by Torque with a "qstat -a" is always 0. :'(

If we switch to openmpi, the elapsed time of the the running jobs are correctly updated.

Is this a known issue ? is there a solution ?

Best regards,
Guillaume

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It's a known issue and it comes up from time to time (eg, elsewhere on this forum, http://software.intel.com/en-us/forums/showthread.php?t=76537 ). The issue is that IntelMPI isn't yet tightly integrated into Torque, and so information like CPU time doesn't get propagated back because Torque doesn't know which processes running on the node are the relevant processes to look at. OpenMPI, on the other hand, can be compiled with explicit torque support (but if you don't, you'll see the same isses).

Issues like elapsed CPU time are a nuisance, but this lack of integration can mean bigger problems if you have jobs fail - they won't be cleaned up properly when the job ends. Suspend/resume becomes impossible, too.

Rumour has it that the next version of IntelMPI, due to come out for SC10 in November, will have better torque integration support. Until then using OSU's mpiexec launcher ( http://www.osc.edu/~djohnson/mpiexec/index.php ) instead of those that come with intelmpi is supposed to work.

--
Jonathan Dursi

ok. thx for reply. So I will wait a little bit.

Have a nice day

Hi,

the original problem was with this "initial" torque configuration:

# config of TORQUE:
create queue batch
set queue batch queue_type = Execution
set queue batch resources_max.cput = 168:00:00
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True

I have modified this torque config with this one and the problem has disappeared.
#
# Create queues and set their attributes.
#
#
# Create and definequeue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = 50
set queue long resources_max.walltime = 72:00:00
set queue long max_user_run = 10
set queue long enabled = True
set queue long started = True
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default Priority = 50
set queue default max_running = 48
set queue default route_destinations = small
set queue default route_destinations += long
set queue default enabled = True
set queue default started = True
#
# Create and define queue small
#
create queue small
set queue small queue_type = Execution
set queue small Priority = 100
set queue small resources_max.walltime = 02:00:00
set queue small max_user_run = 10
set queue small enabled = True
set queue small started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = 1.
set server acl_roots = root@*
set server managers = root@*.
set server managers += sysgen@*.
set server operators = root@*.
set server operators += sysgen@*.
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.10
set server allow_node_submit = True

Best regards

Leave a Comment

Please sign in to add a comment. Not a member? Join today