MIC reset failed

MIC reset failed

Hi,

It seems like I can't get my MIC card working. I followed the instructions in the readme to install MPSS on a fresh Centos 6.3 (which is pretty much the same as RHEL 6.3, I think). The errors I get are not constant, which makes debugging quite hard, but right now, starting mpss using

# service mpss start

fails with this in /var/log/mpssd:

Mon Mar 25 12:10:48 2013: mic0: log_buf_addr: ffffffff832332d0
Mon Mar 25 12:10:48 2013: mic0: log_buf_len: ffffffff81724c70
Mon Mar 25 12:10:48 2013: mic0: Current state "reset failed" cannot boot card
Mon Mar 25 12:10:50 2013: Wait for download requests

The output of miccheck doesn't look good either:

[root@semperphi ~]# /opt/intel/mic/bin/miccheck

miccheck 2.1.5889-14, created 18:10:54 Feb 28 2013
Copyright 2011-2013 Intel Corporation All rights reserved

Test 1 Ensure installation matches manifest : OK
Test 2 Ensure host driver is loaded : OK
Test 3 Ensure driver matches manifest : OK
Test 4 Detect all listed devices : OK
MIC 0 Test 1 Find the device : OK
MIC 0 Test 2 Check the POST code via PCI : FAILED
MIC 0 Test 2> Current POST code is �� (not FF) for MIC 0
MIC 0 Test 3 Connect to the device : SKIPPED
MIC 0 Test 3> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 3> The device is not online
MIC 0 Test 4 Check for normal mode : SKIPPED
MIC 0 Test 4> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 4> The device is not online
MIC 0 Test 5 Check the POST code via SCIF : SKIPPED
MIC 0 Test 5> Prerequisite 'Ensure the device is online' failed:
MIC 0 Test 5> The device is not online
MIC 0 Test 6 Send data to the device : SKIPPED
MIC 0 Test 6> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 6> The device is not in normal mode
MIC 0 Test 7 Compare the PCI configuration : OK
MIC 0 Test 8 Ensure Flash version matches manifest : SKIPPED
MIC 0 Test 8> Prerequisite 'Check for normal mode' failed:
MIC 0 Test 8> The device is not in normal mode
Status: The POST code was not "FF"

The output of micinfo:

[root@semperphi ~]# /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Mon Mar 25 12:13:53 2013

System Info
HOST OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : 5889-14
MPSS Version : 2.1.5889-14
Host Physical Memory : 16300 MB

Device No: 0, Device Name: mic0

Version
Flash Version : NotAvailable
SMC Boot Loader Version : NotAvailable
uOS Version : NotAvailable
Device Serial Number : NotAvailable

Board
Vendor ID : ffff
Device ID : ffff
Subsystem ID : ffff
Coprocessor Stepping ID : f
PCIe Width : x63
PCIe Speed : Unknown
PCIe Max payload size : 16384 bytes
PCIe Max read req size : 16384 bytes
Coprocessor Model : 0x0f
Coprocessor Model Ext : 0x0f
Coprocessor Type : 0x03
Coprocessor Family : 0x0f
Coprocessor Family Ext : 0x0ff
Coprocessor Stepping : B1
Board SKU : NotAvailable
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable

Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable

Thermal
Fan Speed Control : NotAvailable
SMC Firmware Version : NotAvailable
FSC Strap : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable

GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable

Do you have any ideas what steps I can take to start debugging this ?

Simon

33 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Just a quick thought to start with. Do you have large BAR (Base Address Registers) support enabled in your BIOS? It must be greater than 4 gigabytes.

Good question, I forgot to mention it. I enabled the following option in the PCIe section of the BIOS: "Above 4G Decoding (Available if the system supports 64-bit PCI decoding)". It looks like the same thing said differently.

Could you try rebooting your system? Or power cycling it if you are using any kind of virtual machine? I suspect that your problem will go away.

Sometimes when a new MPSS is installed (particularly if everything isn't cleanly shut down before the old MPSS is deinstalled and the new one installed), it takes a couple of reboots to shake things out. Looking at your information again, I think that this is the best fit to your symptoms.

Not only does 5889 require host reboots at additional points beyond those mentioned in the instructions; after host reboot on my box it is necessary to restart network service even though it is shown as running.

Not sure if the OP got it resolved.. I faced a similar issue and it was due to the apps mpssflash/mpssinfo being not installed properly. Installing them and making sure it was in the path ensured I got beyond this error.

I rebooted the host several times, it doesn't help.

I am not sure how the mpss tools can be installed not properly. I followed the installation procedure to the letter, and they are in /opt/intel/mic/bin, which is in my PATH.

we can submit this to our internal teams for investigation but need a couple more things from you:

1) the host kernel log after mpss shows “reset failed”

2) Can you confirm that you have successfully updated the flash/smc of the card during installation?

we can submit this to our internal teams for investigation but need a couple more things from you:

1) the host kernel log after mpss shows “reset failed”

2) Can you confirm that you have successfully updated the flash/smc of the card during installation?

> 1) the host kernel log after mpss shows “reset failed”

After a failed "service mpss start", relevant entries in the system log:

Apr 3 07:25:58 semperphi kernel: mic0: Transition from state reset failed to resetting
Apr 3 07:26:00 semperphi kernel: mic0: Resetting (Post Code ��)
Apr 3 07:26:00 semperphi kernel: mic0: Transition from state resetting to reset failed
Apr 3 07:26:00 semperphi kernel: MIC 0 RESETFAIL postcode �� -1

I guess the ?? that show up are  value that got written directly as binary and not converted to text before getting in the log. In hex, their value is "ef bf bd ef bf bd". Here are the relevant entries in  /var/log/mpssd:

Wed Apr 3 07:26:42 2013: MPSS Daemon start
Wed Apr 3 07:26:42 2013: Configuration version 0.4
Wed Apr 3 07:26:42 2013: mic0: Command line: "quiet root=ramfs console=hvc0 highres=off clocksource=micetc micpm=cpufreq_on;corec6_off;pc3_on;pc6_on"
Wed Apr 3 07:26:42 2013: mic0: log_buf_addr: ffffffff832332d0
Wed Apr 3 07:26:42 2013: mic0: log_buf_len: ffffffff81724c70
Wed Apr 3 07:26:42 2013: mic0: Current state "reset failed" cannot boot card
Wed Apr 3 07:26:44 2013: Wait for download requests

> 2) Can you confirm that you have successfully updated the flash/smc of the card during installation?

I did it, although it took me a few tries before getting a "success" message. I would get some errors I can't remember before.

Thanks,

Simon

Simon, thanks for the info. Another piece needed:

What's your system Board BIOS version? At the # _prompt> type "dmidecode -s bios-verison". (and can you say what kind of system/model you are running?)

The system comes from Supermicro. The board is a Supermicro X9DRG-QF (http://www.supermicro.com/manuals/motherboard/C606_602/MNL-1309.pdf).

[root@semperphi ~]# dmidecode -s bios-version
1.1

Here is the full dmidecode output if it can be useful: http://pastebin.com/vN9r0xgU

Thanks !

Simon

additionally it would be helpful if you could collect some data to help us understand what's going on -- the MPSS team has offered the attached debug script to collect the necessary info... could you run this after a reset failure && send us the resulting zip that gets created (you should be able to attach it to this forum thread)

(note:  I gave the micdebug script a txt extension so that this forum software would allow me to attach it here )

# chmod +x micdebug.txt

# sudo sh ./micdebug.txt

Attachments: 

AttachmentSize
Downloadtext/plain micdebug.txt1.96 KB

The system comes from Supermicro. The motherboard is a Supermicro X9DRG-QF (http://www.supermicro.com/manuals/motherboard/C606_602/MNL-1309.pdf).

The results are here: http://nova.polymtl.ca/~simark/micdebug-03042013-105411.zip

Attaching them directly to the post was triggering the spam filter somehow.

Simon

Simon, thank you for all of the information -- team is researching now.   Interestingly, this is the second report of this problem, against the same platform/OEM; now trying to reproduce in our labs.  

Oh interesting. I look forward to see the results.

Hi Simon:  our internal team has not been able to reproduce your issue, so back to asking you lots of questions to see if we can figure this out.   Can you run minicom on /dev/ttyMIC<n> and send the output from that?

Hi Belinda,

I can't seem to connect to /dev/ttyMIC0.

[root@semperphi ~]# minicom /dev/ttyMIC0
minicom: WARNING: configuration file not found, using defaults
Device /dev/modem access failed: No such file or directory.

Am I doing something wrong ?

Simon, seems that you need to configure minicom to make this happen...here is the list of steps (these I tested on a SLES11 system).

     sudo minicom -s
     Go to "Serial Port Setup"

     Choose option: A - Serial Device
     Edit Serial Device to /dev/ttyMIC0
     Hit <Enter> twice

     Go to "Save setup as.."
     When the input prompt 'Give name to save this configuration?' shows up,
         save the <ConfigName> to the name you prefer.  For example: mic0 <Enter>

     Select "Exit from Minicom"

Then, set up a typescript session to capture what comes next:    script /tmp/minicom.out

Start minicom:      minicom mic0

<stuff scrolls on screen>

Terminate the minicom session: <CTRL>-A-X
Terminate the typescript session:   <CTRL>-D

and send that newly created output file

 

Oops, I assumed the usage was simply "minicom <device)". Nope.

Now minicom hangs on "Initializing modem" for ever...

OK, thank you (for trying minicom) - let's try something else:

echo 0 > /sys/class/mic/scif/watchdog_enabled

 Then, use the following steps to show the micro-OS kernel log buffer

Mount debugfs on the host:    mount -t debugfs none /sys/kernel/debug   

Dump the buffer:   sudo cat /sys/kernel/debug/mic_debug/mic0/log_buf > <some file of your choice>

This log appears to be empty.

[root@semperphi mic0]# cat /sys/class/mic/scif/watchdog_enabled
0
[root@semperphi mic0]# mount | grep debugfs
/sys/kernel/debug on /debugfs type debugfs (rw)
[root@semperphi mic0]# pwd
/debugfs/mic_debug/mic0
[root@semperphi mic0]# cat log_buf > /tmp/mic0_log
[root@semperphi mic0]# ls -l /tmp/mic0_log
-rw-r--r-- 1 root root 0 Apr 12 10:18 /tmp/mic0_log

 

this is probably because the card was already in the 'hung' state, right? We need to actually introduce a step 0 and 0.5 to the above instructions, as what we are trying to do here is log what the card says as it reaches the point where it stops responding.

i.e.
restart mpss service
disable watchdog_enabled
mount debugfs filesystem
capture log_buf -- for best results, it might make sense to capture it this way
script /tmp/micbuf
tail -f /sys/kernel/debug/mic_debug/mic0/log_buf

Ctrl-D to terminate the script session
&& send us what went into /tmp/micbuf

In the meantime, I'm probing around to see if there are debug options to ofed and/or relevant daemons that we should also consider turning on to capture the activities/errors leading up to the hang.
THANK YOU for your patience and for being our eyes on this problem

Hmm still no result.

If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.

Hmm still no result.

If you prefer, I could find a way to give you access to the system itself and diagnose it directly. Alternatively, if we could have a chat on IRC or something, it could be more efficient.

Thank you for your help !

Hi.

I have the same problem with   a

dmidecode -s system-product-name
X9DRG-HF

System.

With a heavy load the system reports

HPL[mic-server] : MIC No. 0 exceeds allowed temperature ( 96 degree).
HPL Warning [mic-server] : Memory load is too heavy. Performance may be low.

after a couple of minutes  the MIC card  "hangs"  and  it's necesary to do a power cycle to recovery  the system.

Simon, please send us lspci -vvv output and I will try to arrange a call to discuss this case (and have Intel access your system) offline in the meantime.   is it safe to assume you've not seen any indication of temperature issues?

Pedro:  do you also have a Supermicro system?

I haven't seen any temperature related messages.

Here is the lspci -vvv output: http://paste.ubuntu.com/5711230/

Yes  X9DRG-HF   is a Supermicro  system.

This are the last lines of  /sys/kernel/debug/mic_debug/mic1/log_buf,  when the card hangs  there is no information of what happened.

<4>[    5.034067] Loading RAS module ver 0.9c. Build date: Mar  8 2013
<4>[    5.036794] RAS: card 22508086:25008086:11 SKU is "B1 SKU2" (60 cores, 16 memch, 0 txs)
<4>[    5.039416] RAS.elog: rev 1, size 3276, head 3275, tail 3275
<4>[    5.039472] RAS.elog: init complete
<4>[    5.039479] RAS.core: init complete
<4>[    5.039523] RAS.uncore: init complete
<6>[    5.049389] mic_pm: micpm: Freq/volt table returned to RAS
<6>[    5.049398] mic_pm: index       freq       voltage
<6>[    5.049408] mic_pm: 0      842104       1040000
<6>[    5.049418] mic_pm: 1      947367       1045000
<6>[    5.049427] mic_pm: 2      1052630       1050000
<4>[    5.049434] micpm: RAS module registered
<4>[    5.049447] RAS module load completed
<4>[    5.540110] RAS.init: module operational
<4>[   21.707071] S01fileperms used greatest stack depth: 5752 bytes left
<4>[   21.761012] ip used greatest stack depth: 5504 bytes left
<4>[   21.908814] ip used greatest stack depth: 5208 bytes left
<4>[   22.321287] Module pm_scif loaded at 0xffffffffa001c000
<1>[   22.328387] [ pm_scif_init : 343 ]:==> pm_scif_init
<1>[   22.328420] [ pm_scif_init : 344 ]:pm_scif insmoded
<1>[   22.328472] [ pm_scif_init : 372 ]: scif_bind successfull. Local port number = 1088, ep = 
<1>[   22.329328] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<1>[   22.329425] [ pm_handle_get_latencies : 99 ]:==> pm_handle_get_latencies
<1>[   22.329504] [ pm_recv_from_host : 191 ]:==> pm_recv_from_host
<7>[   32.052823] mic0: no IPv6 routers present
<4>[11073.684542] sshd used greatest stack depth: 5152 bytes left
<4>[11158.724079] mount used greatest stack depth: 4128 bytes left

If the problem is a temperature issue,   what is the behavior of target in that situation ?  and how i can  check the temperature of the card?

i just upgrade BIOS and IPMI to the last one   but   the command

ipmitool  sensor  shows

CPU1 Temp        | 01h | ok  |  3.1 | 36 degrees C  
CPU2 Temp        | 02h | ok  |  3.2 | 30 degrees C  
System Temp      | 11h | ok  |  7.1 | 20 degrees C  
Peripheral Temp  | 12h | ok  |  7.2 | 39 degrees C  
PCH Temp         | 0Ah | ok  |  7.3 | 49 degrees C  
10G Temp         | 0Bh | ok  |  7.4 | 59 degrees C  
P1-DIMMA TEMP    | B0h | ok  | 32.64 | 25 degrees C 
P1-DIMMB TEMP    | B4h | ok  | 32.68 | 27 degrees C 
P1-DIMMC TEMP    | B8h | ok  | 32.72 | 29 degrees C 
P1-DIMMD TEMP    | BCh | ok  | 32.76 | 30 degrees C 
P2-DIMME TEMP    | D0h | ok  | 32.80 | 19 degrees C 
P2-DIMMF TEMP    | D4h | ok  | 32.84 | 19 degrees C 
P2-DIMMG TEMP    | D8h | ok  | 32.88 | 21 degrees C 
P2-DIMMH TEMP    | DCh | ok  | 32.92 | 19 degrees C 
GPU1 Temp        | 71h | ns  | 11.1 | No Reading    
GPU2 Temp        | 72h | ns  | 11.2 | No Reading    
GPU3 Temp        | 73h | ns  | 11.3 | No Reading    
GPU4 Temp        | 74h | ns  | 11.4 | No Reading    
                                                    
i'll waiting  an answer about that from SM.

-Pedro

The problem was solved upgrading  BIOS & IPMI  version of the  system. 

Now works  well.

I just checked, the BIOS is up to date on my server (X9DRGQF2_C21). IPMI as well (although I don't use it).

may be compatability problem

Hi Simon,

the output of your lspci -vvv was reviewed by a PCIe expert and the recommendation at this point is for you to contact the OEM provider from whom you acquired the platform and card, as this appears to be a hardware problem.   Here I will include some rough notes about the potential things it could be, and the things one could try to do (to narrow down the problem) if so inclined, but at this point it could be anything:  an electrical problem; coprocessor could be dead, there may be issues with the PCI slot, there may be issues with cooling, and ultimately any of these need to result in a conversation with your OEM, for them to fix (e.g. by sending you a replacement)

(I am logging this below for posterity (others, including OEMs who may be watching this thread), but not expecting you to actually do this :) )

Here are debug notes when looking at “lspci –vvv” dump

  1. LnkSta from the bridge indicates the following things which indicate PCIe link is not trained
    1. Width is x0
    2. DlActive is low
    3. From the AERCap register the first error pointer is 5. Though nothing is flagged in the uncorrectable error status register (mostly because an AER was generated and the interrupt handler cleared the error after logging it somewhere (I hope)) bit 5 of uncorrectable error is “Surprise Down” which indicates the link was trained at some point and then dropped.
    4. #2 also explains why lspci actually shows the card but cant get any more information from it

 

Usual suspects for behavior like this

  1. Bad PCIe training or electrical problems with PCIe
  2. Thermtrip
  3. VR Fault

Possible things to try

  1. Limit the link to gen1 and see if  we train
  2. Force de-emphasis to -6db and try gen2
  3. Try slot 8 or slot 10 on the board which have shorter trace lengths
  4. Ensure the card has sufficient cooling
  5. Ensure the power rails are not glitchy. Try to read the VR status using the IPMI tool.

 

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today