Intel MPI with JMI and Slurm: Requeseted node configuration is not available

Intel MPI with JMI and Slurm: Requeseted node configuration is not available

Intel MPI version 4.1.3, Slurm version 2.6.9-1

I am trying to follow the Intel MPI documentation to run a job under Slurm with -bootstrap jmi but am 

getting the error message as below:

 

salloc -N 1 :

export I_MPI_HYDRA_JMI_LIBRARY=/opt/intel/impi/4.1.3/lib/intel64/lib/libjmi_slurm.so

mpiexec.hydra -bootstrap slurm -n 2 hostname ## << this works

mpiexec.hydra -bootstrap jmi -n 2 hostname ## <<this does not work

srun: error: Unable to create job step: Requested node configuration is not available

srun: error: Unable to create job step: Requested node configuration is not available

If I look at Slurm logs, it is trying to get a node assignment for the fqdn of the node, even though

I only use short names in slurm.conf. Not sure it this has anything to do with JMI/Slurm interaction.

If I use

I_MPI_PMI_LIBRARY=/opt/slurm/14.03.1-2/lib64/libpmi.so

srun -n 2 mympiprog

it works too.

9 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Can you provide the output from the -bootstrap jmi version with I_MPI_HYDRA_DEBUG=1?

Hi James

 

Attaching the debug output; the key part seems to be node request by fqdn

 

[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello

- requesting by fqdn

- error message is 

srun: error: Unable to create job step: Requested node configuration is not available

Slurm log

[2014-05-08T21:29:02.760] sched: job_complete for JobId=133 successful, exit code=0
[2014-05-08T21:29:09.647] sched: _slurm_rpc_allocate_resources JobId=135 NodeList=builder,ruchba usec=13039
[2014-05-08T21:29:31.848] sched: _slurm_rpc_job_step_create: StepId=135.0 builder,ruchba usec=7566
[2014-05-08T21:29:31.974] sched: _slurm_rpc_step_complete StepId=135.0 usec=11970
[2014-05-08T21:29:47.668] sched: _slurm_rpc_job_step_create: StepId=135.1 builder,ruchba usec=14242
[2014-05-08T21:29:47.750] sched: _slurm_rpc_step_complete StepId=135.1 usec=13498
[2014-05-08T21:29:52.374] sched: _slurm_rpc_job_step_create: StepId=135.2 builder,ruchba usec=15330
[2014-05-08T21:29:52.416] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.416] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.416] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.416] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.419] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.419] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.419] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.419] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.446] sched: _slurm_rpc_step_complete StepId=135.2 usec=11696

 

 

附件: 

附件尺寸
下载 hydradebug.txt21.19 KB

If you are logged in on the nodes, what does hostname return?  If it returns the FQDN, can you change it to return only the short name?

Both nodes return the short name using hostname.

slurm.conf refers to the nodes using short names as well.

Can you try this with the Intel® MPI Library 5.0 Beta?  If you're not already registered, go to http://bit.ly/sw-dev-tools-2015-beta for details.

Updated to Slurm 14.03.3-2 and Intel Cluster Studio XE beta; IMPI is v5.0.0.016.

Unfortunately, I get exactly the same error messages with JMI.

srun is being invoked with the fqdn of the node, and Slurm responds with "invalid node specified".

[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello

[jmi-slurm@ruchba] Launch arguments: srun --nodelist ruchba.hpc8888.com -N 1 -n 1 ./hello

JMI chooses to use the fqdn naming, yet Slurm's allocation shows the short name

SLURM_JOB_NODELIST=builder,ruchba.

The other two methods of invocation still work, i.e.,

I_MPI_PMI_LIBRARY=/opt/slurm/slurm/lib64/libpmi.so srun -n 4 hello

mpiexec.hydra -bootstrap slurm -n 4 ./hello

 

 

We're submitting this to our developers for further investigation.

Best Reply

I have been informed by support that this will be fixed in Intel MPI 5.0.2.

发表评论

登录添加评论。还不是成员?立即加入