intel mpi 4.0 and torque problem

intel mpi 4.0 and torque problem

Hello

When i use qdel command with torque ant indelmpi 4.0 the process desappear in torque schedule but in the node the process still runnning, only desappear the parent process (mpdboot) but not the executable.

This is a problem of mpi or torque?

Somebody can help me please.

Regards
Joaquin

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Joaquin,

Unfortunately, the Intel MPI Library does not offer tight integration with the Torque scheduler at this time (but some is coming in the next version). The following discussion seems to relate to your problem, although the customer there is using MPICH.

Starting with Intel MPI Library 4.0, we've added support for the Hydra scheduler (as part of the new MPICH2 Nemesis architecture), which might help in your case. I'd suggest taking a look at the Hydra section of the Reference Manual (located in the /doc directory on your cluster) and trying out the new mpiexec.hydra scripts. As I mentioned, we'll be looking to improve the scheduler integration even further with our next release in the fall.

Alternatively, you can take a look at OSC's mpiexec tool. It has support for Intel MPI Library binaries as well.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Thank you so much for the answer, are really bad news for me, because with intelmpi I obtain very good times in simulations with quantum-espresso software.

By the way, two aditional questions :

1.- What is the best scheduler in order to use intel-mpi?
2.- The next release, have a planned date? It is sure that have a best integration with torque?

Regards
Joaquin

Hi Joaquin,

I'm glad to hear that you're getting really good performance results with the Intel MPI Library, at least :)

1. There's really no best scheduler for the Intel MPI Library. We support all of them fairly consistently. We've had pretty good experiences with some of the more popular ones - PBS Pro, LSF, etc. but, in general, support should be uniform.

2. Tentatively, our next release will be in time for the SC10 conference in New Orleans, LA. So, late October or early November.

In the meantime, I would still suggest you give the new mpiexec.hydra a try. That's included in your current 4.0 version of the library.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Dear Gergana

Unfortunatelly, mpiexec.hydra have the same problem with torque, when a job not finish well, the process still zombies or sleep and using CPU.

I will tray with mpiexec the next week,

Thanks for everything
Joaquin Peralta

Quoting jperaltac
Dear Gergana

Unfortunatelly, mpiexec.hydra have the same problem with torque, when a job not finish well, the process still zombies or sleep and using CPU.

I will tray with mpiexec the next week,

Thanks for everything
Joaquin Peralta

Joaquin,

we have the same problems as you and these are common issues as Intel-MPI layer does not use the TM protocol which is the native way Torque/PBS start up remote tasks in a cluster. The problem of lack of integration with the batch scheduler has several ramifications. The scheduler cannot track resource usage so jobs abusing memory for instance cannot be automatically killed by the batch system. Another more subtle issue is that a job cannot be preempted/suspended and resumed as the scheduler does not know to which processes to apply job control.

This is a show stopper for HPC centers which apply heavy scheduling to make scarce cluster resources be better utilized.

As Gergana mentioned, you can download the Ohio Supercomputer Center PBS mpiexec command which will replace Intel mpirun. We have used this to successfully launch and track jobs using Intel-MPI with Torque/Maui. However it is something external to both MPI and the scheduler so we are a little reserved to advertise it wholesale to our users.

I hope that we will see a native Torque/Intel-MPI integration and I am sure a long list of centers wish for that. In my opinion, the PMI / python "protocol" should be banned.....

regards ...

Michael

R/D High-Performance Computing and Engineering

Leave a Comment

Please sign in to add a comment. Not a member? Join today