Cannot ssh into Phi Coprocessor

Cannot ssh into Phi Coprocessor

Whenever I use the command "ssh mic0" i get the error no route to host on port 22. I currently ssh to the computer I'm trying to program with the Co-processor. I can use "service mpss start" and that works. I really need help with this because I've had this for a month and I still can't get this to work.

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In my experience, an intermittently working coprocessor gave this error, but only with a stretched interpretation of "service mpss start works."  Until it failed completely, that coprocessor required a power-on host reset and mpss start; then it would stay active for a few hours.  Intel requested it be returned to repair depot but there has been no further word.

So, if it appears to be a hardware related issue, you would file a ticket on your support account, carry out diagnostics which may be requested by the support engineer, and determine the next step.

Hi
Use (nmap) utility on the (ip of the  mic) 
for determine  if the service (RPC 22) correctly loaded..
It's difficult to help you without this card in my hands,
(network several case problem possibility)

I think  address for mic could be 172.31.1.1 probably,  
so for mic
(nmap  172.31.1.1)
and wait his answer.
Normally he must show between other port 22
if no answer
use command for control icmp (maybe  if mic use an other address or  serve with range dhcp) ?:
ping 172.31.1.1
Regards

Hi there,

I have the same issue here. CentOs 6.4, updated kernel. Installation went smoothly (had to compile the kernel part from sources). Everything seems to work ok. Activate MPSS service with chkconfig and seems to run fine. But after each host reboot, I cannot ssh the mics. I get no response at all. The only way to make things work is to MPSS stop, micctrl --resetconfig and MPSS start. Then I can successfully make ssh connections to the mic cards.

I attach the micdebug.sh output as requested on the workflow posted on one of the sticky posts.

Thanks in advance,

Jose

Attachments: 

Hi again,

I have seen this on the messages log file:

Jul 14 15:25:29 localhost NetworkManager[2156]: <warn> /sys/devices/virtual/net/mic0: couldn't determine device driver; ignoring...
Jul 14 15:25:29 localhost NetworkManager[2156]: <warn> /sys/devices/virtual/net/mic1: couldn't determine device driver; ignoring...
Jul 14 15:25:29 localhost NetworkManager[2156]: <warn> /sys/devices/virtual/net/mic0: couldn't determine device driver; ignoring...
Jul 14 15:25:29 localhost NetworkManager[2156]: <warn> /sys/devices/virtual/net/mic1: couldn't determine device driver; ignoring...

Any hints?

Let's back up a bit

Cameron - If I am reading your problem correctly, you have never been able to ssh to the coprocessor (not after a reboot of the host, not a case of it working for while and then quitting.) Is this the case? Could you send more information - what OS are you using on the host? is it one of the default supported ones? (and if not did you rebuild the mic.ko kernel module before using?) Did you go the micctrl --initdefaults and micctrl --resetconfig sequence from the MPSS installation instructions? (this should be done even when it is not the first install.) Can you run jobs with the offload programming model?

Jose, your problem seems different. You should not need to rerun the --resetconfig each time. I haven't had a chance to look at the tar file you sent yet, but the log message looks like the kernel module isn't loading. From what you wrote, I think you are saying that you have set the mpss service to restart each time the host reboots. Is this correct? The next time this happens could you check to see if the module loaded and if it didn't, try going through the steps in http://software.intel.com/en-us/forums/topic/393956 to try to determine if it might be a hardware problem, as Tim suspects? Given that your card appeared to be working before the MPSS update, hopefully not.

Quote:

Frances Roth (Intel) wrote:

Let's back up a bit

Cameron - If I am reading your problem correctly, you have never been able to ssh to the coprocessor (not after a reboot of the host, not a case of it working for while and then quitting.) Is this the case? Could you send more information - what OS are you using on the host? is it one of the default supported ones? (and if not did you rebuild the mic.ko kernel module before using?) Did you go the micctrl --initdefaults and micctrl --resetconfig sequence from the MPSS installation instructions? (this should be done even when it is not the first install.) Can you run jobs with the offload programming model?

Jose, your problem seems different. You should not need to rerun the --resetconfig each time. I haven't had a chance to look at the tar file you sent yet, but the log message looks like the kernel module isn't loading. From what you wrote, I think you are saying that you have set the mpss service to restart each time the host reboots. Is this correct? The next time this happens could you check to see if the module loaded and if it didn't, try going through the steps in http://software.intel.com/en-us/forums/topic/393956 to try to determine if it might be a hardware problem, as Tim suspects? Given that your card appeared to be working before the MPSS update, hopefully not.

Thanks a lot Frances,

I have followed the chart but no luck, everything seems to be right. Yes I did configure the service to be loaded on startup and everything seems OK. I have checked if the module is loaded and it seems so:

[jrbcast@localhost ~]$ lsmod | grep mic
mic                   581919  4

But if I perform an ifconfig -a, mic interfaces do not have any IP assigned. after sopping the service, loading resetconfig and reloading mpss, they are correctly assigned ips and I can ssh to the cards. By the way, this is the first time I install the system so I don't know if the cards did work before as I understand from your last sentence.

Any clues?. If not, I will create a cron task and do all the things there...

Cheers,

Jose

Well, it seems I somehow solved my problem. I realized that the only problem was with virtual network interfaces not being assigned its configuration. So I added two lines to the mpss service:

ifup mic0

ifup mic1

And everything seems to be running fine now.

Hope this helps to find the "real" issue.

Regards,

Jose.

Dear Jose,

I am having the same issue, also with CentOS6.4.

Can I ask a stupid question:

how do you "add" those lines to the mpss service?

Thanks

Unfortunately I have let this issue hang around for a while. Let's see if we can't get jofre running without a work around.

You say you are running CentOS6.4. What version of the MPSS are you running?

Hi Frances,

I thought I will get an alert if somebody bother answering... so sorry for not answering before to you post.  I have been busy with other projects, but I am back trying to get my Xeon Phi up and running. When I leave it, I was trying to make a mount point for the co-procesor, so it has its "hard drive", so to speak. Xeon Phi is being far from easy to set-up...

Sorry for the bla, bla. To answer your question: I installed the MPSS 3.1

Jofre

Hi Jofre,

Being the end of the year, things are a little chaotic here (as everywhere).

I'm contacting Frances and we'll get back to you.

Regards
--
Taylor
 

Hi Jofre,

      "I was trying to make a mount point for the co-procesor, so it has its "hard drive""

Please help me understand. The OS on the host and the coprocessor is standard Linux. Mounting an NFS directory tree should be no different than on any other Linux system.

Regards
--
Taylor
 

I am also experiencing these same symptoms - but with a slight difference:

I have two Xeon Phi's on five compute nodes - part of an HPC cluster.  What we do is fully install and integrate MPSS on node 1 and then use a provisioning tool (Cluster Management Utility) to back up node 1 and provision the rest of the servers.

I have node1 working perfectly, and it also survives reboots as well:  I can ssh into mic0 and mic1 without problems and the cores are totally available.  HOWEVER, after cloning from node1 to nodes 2 - 5 those mic's are not available until *after* I run the following command:

micctrl --resetconfig

I'm confused about this behavior, because all IP's on all five systems are exactly the same, and all mic devices on each system also have the same hostname - but something in that command "fixes" whatever breaks during the cloning process.

Given that these coprocessors are used heavily in the HPC world, it's a safe bet that they should be expected to work even if the OS is cloned from another system.

More background:  yes, all five systems were flashed with MPSS firmware individually and all internals are working fine.  But it is a pain to have to reconfigure these mics every time we provision our cluster.  I don't have to do this at all with Nvidia drivers and Cuda.

Leave a Comment

Please sign in to add a comment. Not a member? Join today