Multiple MPI Jobs On A Single Node

Multiple MPI Jobs On A Single Node

Bild des Benutzers hiewnh@ihpc.a-star.edu.sg

I have a cluster of 8-sock quad core systems running Redhat 5.2. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. For example, if I were to submit 4 8-way jobs to a single box they all end up in CPUs 0 to 7, leaving 8 to 31 idle.

I then tried all sorts of I_MPI_PIN_PROCESSOR_LIST combinations but short of explicitly listing out the processors at each run, they all end up still hanging on to CPUs 0-7. Browsing through the mpiexec script, I realise that it is doing a taskset on each run.
As my jobs are all submitted through a scheduler (PBS in this case) I cannot possibly know at job submission time which CPUs are not used. So is there a simple way to tell mpiexec to set the taskset affinity correctly at each run so that it will choose only the idle processors?
Thanks.

4 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers draceswbell.net
Quoting - hiewnh@ihpc.a-star.edu.sg I have a cluster of 8-sock quad core systems running Redhat 5.2. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. For example, if I were to submit 4 8-way jobs to a single box they all end up in CPUs 0 to 7, leaving 8 to 31 idle.

I then tried all sorts of I_MPI_PIN_PROCESSOR_LIST combinations but short of explicitly listing out the processors at each run, they all end up still hanging on to CPUs 0-7. Browsing through the mpiexec script, I realise that it is doing a taskset on each run.
As my jobs are all submitted through a scheduler (PBS in this case) I cannot possibly know at job submission time which CPUs are not used. So is there a simple way to tell mpiexec to set the taskset affinity correctly at each run so that it will choose only the idle processors?
Thanks.

Use the "-genv I_MPI_PIN disable" to resolve the immediate problem of multiple jobs pinning to the same cores. We use SGE at our site, but the root issue remains the same.The interaction between the scheduler and MPI might need a little better definitionwith current systems. If performance is a big issue, then you might want to consider only allocating using all of the cores on a compute node.

At the very least, you probably should make this a site wide default if your scheduler will keep assigning partial nodes to jobs.

Bild des Benutzers Gergana Slavova (Intel)
Hi,

Certainly, disabling process pinning altogether (by setting I_MPI_PIN=off) is a viable option.

Another workaround we recommend is to let Intel MPI Library define processor domains for your system but let the OS take over in pinning to available "free" cores. To do so, you need to simply set I_MPI_PIN_DOMAIN=auto. You can either do so for all jobs on the node, or for each subsequent job (job 1 will still be pinned to cores 0-7).

What's really going on behind the scenes is that, since domains are defined as #cores/#procs, we're setting the #cores here to be equal to the #procs (so you have 1 core per domain).

Note that you can only use this if you have Intel MPI Library 3.1 Build 038 or newer.

I hope this helps. Let me know if this improves the situation.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com
Bild des Benutzers zhubq

Hi Gergana,

Could you please look at my post http://software.intel.com/en-us/forums/topic/365457

Thank you.

Benqiang

Zitat:

Gergana Slavova (Intel) schrieb:Hi,

Certainly, disabling process pinning altogether (by setting I_MPI_PIN=off) is a viable option.

Another workaround we recommend is to let Intel MPI Library define processor domains for your system but let the OS take over in pinning to available "free" cores. To do so, you need to simply set I_MPI_PIN_DOMAIN=auto. You can either do so for all jobs on the node, or for each subsequent job (job 1 will still be pinned to cores 0-7).

What's really going on behind the scenes is that, since domains are defined as #cores/#procs, we're setting the #cores here to be equal to the #procs (so you have 1 core per domain).

Note that you can only use this if you have Intel MPI Library 3.1 Build 038 or newer.

I hope this helps. Let me know if this improves the situation.

Regards,
~Gergana

Melden Sie sich an, um einen Kommentar zu hinterlassen.