MIC internal bridge not working now

MIC internal bridge not working now

Hi All,

I recently made several changes to a small workstation running four MIC Phi cards.  My use is large for MPI, so I need all cards to be able to communicate with one another.  This was working fine before, but now when I try to execute test code, I run into the following problem.  

Simple examine, just trying to run MPI test code on only on MIC card:

mpirun -n 60 -hosts mic0 ./testMPI+openMP

which generates the following error:
[proxy:0:0@Axial-mic0.localdomain] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "Axial-mic0.localdomain" to "10.50.6.239" (Network is unreachable)
[proxy:0:0@Axial-mic0.localdomain] main (./pm/pmiserv/pmip.c:396): unable to connect to server 10.50.6.239 at port 50415 (check for firewalls!)

The internal bridge (br0) is on 10.10.10.x and the host is on 10.10.10.254.  I can ssh to the cards from the host, and from the cards back to 10.10.10.254, but the cards cannot connect to the eth0 ip of the host (10.50.6.239).  What I don't understand is why mic0 is trying to connect to 10.50.6.239  instead of 10.10.10.254.

Any ideas?

Thanks!

 

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here is some followup info to the above problem (maybe).  When I reset the mpss to defaults using:

micctrl --resetdefaults

I get the warnings:

[Warning] mic0: Generating compatibility network config file /opt/intel/mic/filesystem/mic0/etc/sysconfig/network/ifcfg-mic0 for IDB.
[Warning]       This may be problamatic at best and will be removed in a future release, Check with the IDB release.

I had been running MPSS 3.2.1, which I uninstalled and updated with 3.2.3.  I don't recall getting these warnings before.  Since my issues are network related, perhaps this may be some useful info.

thanks. 

The warning message is a red herring (a misleading clue) - or at least it should be. It means that micctrl is still creating an old style ifcfg-mic0 file under /opt/intel/mic/filesystem for use by IDB (Intel® debugger). The real interface file - the only one that is copied over to the coprocessor - is /var/mpss/mic0/etc/network/interfaces. The network interface described by both files should be the same, just a different format. Of course, when things are going wrong, it never hurts to check just to be sure.

As far as MPI trying to use the wrong interface to get to the coprocessor, did you set the environment variable I_MPI_MIC=on? Since you had been running MPI jobs before you set up your bridge, you are probably already doing this. But I thought I would ask.

Hi Frances

Yes, it is set as: I_MPI_MIC=enable.  I haven't had a chance to look at the interface files yet, but that's the next thing I'll look at (as soon as i fix some other problems i've caused with ssh :)

Thanks for your input!  If you have any other ideas please let me know,

-joe

Hi Frances,

You were correct!  The problem was in the interfaces file, which for some reason was generate as:

# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)

# The loopback interface
auto lo
iface lo inet loopback

# MIC virtual interface
auto mic0
iface mic0 inet static
    address 10.10.10.1
    gateway  (null)
    netmask 255.255.255.0

I changed the gateway to 10.10.10.254 for each card and that solved the above problem.  Thanks so much for your help!

-joe

Argh, I thought this fixed the problem.  The above correction did allow me to run MPI code on any single MIC card, but when I try to run MPI code on the cpu and MIC card, or on multiple mic cards, I get the following errors (repeated many times):

CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4

I'm guessing this is a different problem.  Interestingly, after generating this long list of errors (depending on no. of cpus requested), the code does seem to run (not sure if correctly though).

I've found some help on line, but I'm not sure what the provider list in the  I_MPI_DAPL_PROVIDER_LIST variable should be yet.  

Any thoughts?

found the answer here: https://software.intel.com/en-us/forums/topic/516287

just need to set: export I_MPI_FABRICS=shm:tcp

or I suppose I could set something similar in /etc/dat.conf (i think)

finally back up....

Leave a Comment

Please sign in to add a comment. Not a member? Join today