Problem running MPI on two nodes: host and Xeon Phi

Problem running MPI on two nodes: host and Xeon Phi

Hello,

I am having trouble running a simple hello world test program on two nodes. I was hoping someone would be able to help.

OS: CentOS 6

Here is the error:

[phi@localhost ~]$ mpirun -n 2 -host mic0 -iface mic0 ./hello.MIC : -n 2 -host localhost ./hello.XEON
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
Hello World from rank 2 running on localhost.localdomain!
Hello World from rank 3 running on localhost.localdomain!
Hello World from rank 0 running on mic0.local!
MPI World size = 4 processes
Hello World from rank 1 running on mic0.local!

It still runs the code, but it takes a long time. and the more processes i use, the worse it becomes (obviously).

When i run the code on either the host or mic0 alone, everything seems to work fine.

if anyone has any idea on how to fix this, please help me out.

Thanks,

Charlie

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I had the same issue today. I managed to remove the warning by using:

$ export I_MPI_FABRICS=shm:tcp

or directly using

$ mpirun -machinefile hosts -env IMPI_FABRICS shm:tcp <executable>

Not sure it's the proper way to fix the problem. Other suggestions are most welcome.

Referring to https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm I can guess that this warning is due to the absence of the file /etc/dat.conf in my system.

I did try to use other values for I_MPI_FABRICS (shm:dapl, shm:ofa, shm:tmi) but as I expected these are not available in my (single node) system.

Hope it works for you.

 

Malek

Thank you, I will definately try it out and leave a comment afterwards.

Charlie

Charlie,

Did anything come of this?

Regards
--
Taylor
 

Hey Taylor,

Sorry for the late response. This issue is still unresolved. My thinking is that these errors are a false negative to the real problem.

I am trying to run MPI on host and mic at the same time. Sure, i get these errors, but the NPB benchmarks i am testing still run up to 8 processes. Something happens between 8 and 16 processes that i dont quite understand and it refuses to execute.

I am looking at ways to debug this issue and i was going to post another topic once I know the type of question I need to ask.

Thank you,

Charlie

Charlie,

Posting another topic is a great idea. It allows the community to more easily find and follow the issue.

Thanks
--
Taylor
 

Leave a Comment

Please sign in to add a comment. Not a member? Join today