weird error of Xeon Phi cards on running IMB

weird error of Xeon Phi cards on running IMB

Hi,

I am using Intel MPI Benchmark to evaluate my Xeon Phi cards (5110p). Particularly, for the Pingpong test, all of my cards work well but one. This card will fail and automatically reboot when the message size of Pingpong test reaches 1G, which causes the benchmark to crash. Using micctrl -s to check the card, I observed that this card first became "lost", then "rebooting", and finally "online" again.

Any ideas?

Thanks!

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi JS,

What MPSS version, Intel compiler version, and Intel MPI libraries are you using? Also, could you attach the output from micinfo command for the co-processor that has problem? Thank you.

Quote:

loc-nguyen (Intel) wrote:

Hi JS,

What MPSS version, Intel compiler version, and Intel MPI libraries are you using? Also, could you attach the output from micinfo command for the co-processor that has problem? Thank you.

Thanks for replying. I am using MPSS 3.1.2, icc 14.0.0, impi 4.1.1.036.

The micinfo output is as attached. There are two MICs per node, the problematic card is mic1.

Attachments: 

AttachmentSize
Download mic1.log2.49 KB
jimdempseyatthecove's picture

The only differences I observe (other than serial number) is the core voltages are different mic0:1013000 uV, mic1:949000 uV and temperatures different mic0:61C, mic1:51C.

You may have a thermal condition.

What happens to the temperatures of each card as you run the test?

What happens after temperatures settles, if you run the ping-pong test starting at 1G?

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

The only differences I observe (other than serial number) is the core voltages are different mic0:1013000 uV, mic1:949000 uV and temperatures different mic0:61C, mic1:51C.

You may have a thermal condition.

What happens to the temperatures of each card as you run the test?

What happens after temperatures settles, if you run the ping-pong test starting at 1G?

Jim Dempsey

Thanks for reply. The temperatures of the cards will increase a little bit as I run the test. The settled temperatures when running Pingpong test are roughly mic0: 65C, mic1 54C. I observed similar temperatures in other nodes (the MIC cards there behave normally in Pingpong test). Not sure if this is a thermal condition.

jimdempseyatthecove's picture

And when you ran the test starting at 1G what happened?

Crash right away, or some time after running?

Can you Ping-Pong between host and each mic individually? Does the problem show up there as well?

I haven't run the status reports, but in the Xeon Phi Data Sheet, table 6-23, it shows various status bits. One of the MIC utilities should have a log of errors and/or status. Assuming mic1 can report the status prior to resetting. Of particular interest is the P12V_UVOV.

You may have a power supply issue.

Try swapping the power cables between cards. If the problem shifts between cards then suspect insufficient capacity on the cables connected to the failing card. What do you have for video card? Is the video card and failing MIC on the same set of power cables?

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

And when you ran the test starting at 1G what happened?

Crash right away, or some time after running?

Can you Ping-Pong between host and each mic individually? Does the problem show up there as well?

I haven't run the status reports, but in the Xeon Phi Data Sheet, table 6-23, it shows various status bits. One of the MIC utilities should have a log of errors and/or status. Assuming mic1 can report the status prior to resetting. Of particular interest is the P12V_UVOV.

You may have a power supply issue.

Try swapping the power cables between cards. If the problem shifts between cards then suspect insufficient capacity on the cables connected to the failing card. What do you have for video card? Is the video card and failing MIC on the same set of power cables?

Jim Dempsey

It will fail after 1G runs a while. Pingpong test between host and the problematic mic will fail in a similar fashion.

I tried to swap the power cords, but the problem stayed at mic1. It looks like the momory usage of mic1 cannot go beyond 2000MB according to micsmc-gui. The power cable is exclusive for MIC.

jimdempseyatthecove's picture

The next step to perform is to power off system, remove MIC1, reinstall MIC1. On my system box and motherboard I noticed that the seating of lower mounted MIC was problematic. You may be seeing a similar issue. After reseating, verify that the PCIe latch/lock fully closes. I have dual 5120P's installed in an ASUS P9X79-WS. Examine the seating of the MIC in the PCIe on the non-latch side too. I had an additional issue with the motherboard mounted slightly too deep into the case. This cause the ear bracket with screw slot to hold the card slightly out of being fully seated in the slot. To fix this I added a washer under the standoffs on the bottom of the board. I have a tower, so the "bottom" is the away from CPU side. Note, if some of the PCIe are not making contact (but most are) you may have addressing errors while everything else seems OK.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

The next step to perform is to power off system, remove MIC1, reinstall MIC1. On my system box and motherboard I noticed that the seating of lower mounted MIC was problematic. You may be seeing a similar issue. After reseating, verify that the PCIe latch/lock fully closes. I have dual 5120P's installed in an ASUS P9X79-WS. Examine the seating of the MIC in the PCIe on the non-latch side too. I had an additional issue with the motherboard mounted slightly too deep into the case. This cause the ear bracket with screw slot to hold the card slightly out of being fully seated in the slot. To fix this I added a washer under the standoffs on the bottom of the board. I have a tower, so the "bottom" is the away from CPU side. Note, if some of the PCIe are not making contact (but most are) you may have addressing errors while everything else seems OK.

Jim Dempsey

Thanks for the continuous help! I re-plugged the cards, even swapped them. The problematic card (now detected as mic0) still failed. I don't think it is a pci-e plugging issue now.

jimdempseyatthecove's picture

I am sorry we haven't resolved your issue. Have you run the micras service? I haven't used it myself, so you will have to read the manual. BTW the Intel Xeon Phi software configuration users guide lists micras and micrasd I do not know if this is a typeo, two utilities or name change. Combined with micras(s) is crashmgr.

Using micras or micras+crashmgr .AND. running the Ping-Pong tests that fails may yield some insight as to what is happening inside the failing mic.

Additional note,
The micras (micrasd) has a maintenance mode. The user guide has skimpy documentation on what this does and how to use it, it says this forces the card into Maintence test and repair mode (but nothing else is mentioned). Before you run it, I suggest you get what reports you can an search the intel.com site for additional information on micras/micrasd and maint option.

Jim Dempsey

www.quickthreadprogramming.com
Frances Roth (Intel)'s picture

micras is the name of the service; micrasd is the name of the daemon started by the service. So the command:

service micras start

starts the daemon logging messages to /var/log/micras.log on the host. When started like this, the daemon starts in maintenance mode. To start the daemon without maintenance mode, you will need to start it by hand. I'm afraid I don't know much about maintenance mode. There is additional information on the RAS system in section 3.3 of the Intel® Xeon Phi™ Coprocessor System Software Developers Guide https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-sys.... It's a very long section, I'm afraid. 

The lower core voltage on the card that is having problems is kind of suspicious. It would have been really nice if reseating the card had upped the voltage to the same level as the other cards. If looking at the micras log doesn't tell you anything (and even if it does), it may be time to submit the problem to support.

jimdempseyatthecove's picture

The lower voltage isn't completely suspicious, but worth looking into. The Xeon Phi Configuration Users Guide shows a micinfo report with core voltage at 999000 uV. JS's report showed 949000 uV (~5% lower). I haven't been able to (quickly) locate what the operating range is. In the Xeon Phi Coprocessor Datasheet Figure 5-8 shows the coprocessor P-States and Turbo but the VID Voltage chart legends are not listed. So it is unclear what the voltage is for state Pn.

Because the Xeon Phi can control the VID Voltage (to change states), I would venture to guess that each P-State has a value that is programmable. I would further venture to guess that micras -maint may have the ability to tune/fixup these set points. I therefor suggest that JS try to find documentation on the maintenance mode prior to running it. Frances, if you can ask your colleagues for a link to micras maintenance mode, please post the link here for JS and others to find.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

Frances Roth (Intel) wrote:

micras is the name of the service; micrasd is the name of the daemon started by the service. So the command:

service micras start

starts the daemon logging messages to /var/log/micras.log on the host. When started like this, the daemon starts in maintenance mode. To start the daemon without maintenance mode, you will need to start it by hand. I'm afraid I don't know much about maintenance mode. There is additional information on the RAS system in section 3.3 of the Intel® Xeon Phi™ Coprocessor System Software Developers Guide https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-sys.... It's a very long section, I'm afraid. 

The lower core voltage on the card that is having problems is kind of suspicious. It would have been really nice if reseating the card had upped the voltage to the same level as the other cards. If looking at the micras log doesn't tell you anything (and even if it does), it may be time to submit the problem to support.

Thanks for your reply. Reseating MICs didn't bring the voltage of the problematic MIC to a higher level.

Quote:

jimdempseyatthecove wrote:

I am sorry we haven't resolved your issue. Have you run the micras service? I haven't used it myself, so you will have to read the manual. BTW the Intel Xeon Phi software configuration users guide lists micras and micrasd I do not know if this is a typeo, two utilities or name change. Combined with micras(s) is crashmgr.

Using micras or micras+crashmgr .AND. running the Ping-Pong tests that fails may yield some insight as to what is happening inside the failing mic.

Additional note,
The micras (micrasd) has a maintenance mode. The user guide has skimpy documentation on what this does and how to use it, it says this forces the card into Maintence test and repair mode (but nothing else is mentioned). Before you run it, I suggest you get what reports you can an search the intel.com site for additional information on micras/micrasd and maint option.

Jim Dempsey

Thanks for the suggestion! I will take a shot at micras.

As Frances previously mentioned, this may be something that ultimately needs to be reported to the OEM who provided you the coprocessors and system housing them.  

Since you are able to reproduce the problem with the problematic card, I suggest collecting the following data:

- as mentioned previously, the micras service may be able to detect potential issues -- which would get logged in /var/log/micras.log

- Before you start your pingpong test, I would recommend also capturing the buffer log for the problematic coprocessor, as described in this article: https://software.intel.com/en-us/blogs/2013/06/05/collecting-debug-data-...

to reiterate those steps (and assuming the problematic coprocessor is mic1):

echo 0 > /sys/class/mic/scif/watchdog_enabled

Then, use the following steps to show the micro-OS kernel log buffer

Mount debugfs on the host: mount -t debugfs none /sys/kernel/debug

Dump the buffer:

cat /sys/kernel/debug/mic_debug/mic1.log_buf > <some file of your choice> (shows contents of the buffer up until now)

sudo tail -f /sys/kernel/debug/mic_debug/mic1/log_buf | tee -a <some file of your choice> (collects any recent and new data as things run; also outputs contents to STDOUT)

Also, collecting the 'micdebug.sh' output would be useful.   

Once you have all these, please feel free to attach them to this forum thread and we'll look at them for clues on what's happening

 

Login to leave a comment.