Multiple IntelMPI Jobs On A Single Node using Slurm

Multiple IntelMPI Jobs On A Single Node using Slurm

Bild des Benutzers David Touati

I have a cluster single node of 4 socket total 32 corers. The systems running Redhat 6.3, and intelmpi 4 update 3. I am using Slurm to start mpi jobs. It seems that whenever I try to run multiple MPI jobs to a single node all the jobs end up running on the same processors. Moreover i notice that the job use all the cores in the node. For example: i started with the first mpi job using Slurm on the node with 8 cores; and i notice that the first mpi task run on 0 to 3 cpus, the2-ndmpi task on 4-7 cpus, and so on the last task on 28-31. Each mpi task used 4 cores instead 1. i started the 2-nd job with 8 cores, and i notice the same and they run on the same 32 cpus of the first job.

there a way to tell mpirun using Slurm to set the taskset affinity correctly at each run so that it will choose only the idle processors according the Slurm?
Thanks.

5 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Tim Prince

As far as I know, you must set I_MPI_PIN_DOMAIN=off in order for this to work at all.  If you can demonstrate value for receiving CPU assignments from slurm, you might file a feature request.   I don't thnk spllitting it down to the core level for separate jobs is likely to work well.  Maybe splitting down to socket level could be useful.  You could make a case that clusters with nodes of 4 or more CPUs will be more valuable with such a feature. 

If your request turns out to be out of the main stream, you might have to script it yourself, using KMP_AFFINITY or the OpenMP 4.0 equivalent to assign cores to each job.

Bild des Benutzers James Tullos (Intel)

Hi David,

If you are using cpuset, the current version of the Intel® MPI Library does not support it.  The next release will, so if that is the case, just sit tight for a bit longer.

If not, let me know and we'll work from there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers David Touati

Thanks james,

I am not using cpuset. I assume that slume that will do the work.

Bild des Benutzers James Tullos (Intel)

Hi David,

I misunderstood your original question, so let's change approach.  We do not currently check resource utilization from the job manager.  Internally, we typically reserve an entire node for a single job when running, as two different MPI jobs do not communicate with each other.

At present, the only way to do this is manually.  You'll need to get a list of available cores from SLURM*.  Is your application single or multi-threaded?

If single threaded, then you'll set I_MPI_PIN_PROCESSOR_LIST to match the available (and desired) cores, with one rank going to each core.  This will define a single core for each rank to use.

If multi-threaded, then you'll set I_MPI_PIN_DOMAIN instead.  This will set a group of cores available for each rank, and you'll use KMP_AFFINITY to control the thread placement within that domain.

There are quite a few syntax options for each of these variables, so please check the Reference Manual for full details.

As Tim said, if you're interested, I can file a feature request for this capability.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Melden Sie sich an, um einen Kommentar zu hinterlassen.