Bug in Intel MPI 4.1.0.024 with slurm-2.5.4

Bug in Intel MPI 4.1.0.024 with slurm-2.5.4

It looks like there is a bug in the way Intel MPI interacts with SLURM.  I had the following hostlist in SLURM_JOB_NODELIST

itc[011-012,021,101]

Other versions of MPI such as OpenMPI have had no problems interpreting this.  However Intel MPI when it used that node list it tried to find itc017.  That isn't even a valid hostname let alone at that hostlist.

I wrote a script to bypass this and generate the correct host list and explicitly pass it to Intel MPI.  However, it would be better to fix this inside of Intel MPI itself.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Paul,

Thank you for this report.  Can you please run a test program (the provided MPI test programs will work perfectly) with I_MPI_DEBUG=5 and send the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Sure here is the output.  The host list was: itc[011-012,092,101]

/n/sw/intel_cluster_studio-2013/impi-4.1.0.024/bin64/mpirun: line 262: printf: 092: invalid octal number
srun: error: Unable to create job step: Requested node configuration is not available
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@itc011.rc.fas.harvard.edu] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@itc011.rc.fas.harvard.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@itc011.rc.fas.harvard.edu] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completion

Hi,

This is a bug in mpirun (6000024691). The printf command is in script mpirun, the host with number beginning by 0 are convert in octal.

Line 262 et 626 of the script mpirun :

${base_name}%0${first_node_length}d" ${host}`

should be replace by

${base_name}%0${first_node_length}d" ${host#0}`

The ${host#0} will remove all '0' from the beginning of host number, then it will no more be interpreted as octal.

Regards,

Bruno

Please try using Version 4.1 Update 3.  This problem should be corrected.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Leave a Comment

Please sign in to add a comment. Not a member? Join today