Phi cards crash and won't reset shortly after booting.

Phi cards crash and won't reset shortly after booting.

I have a host with two mic cards that will properly boot when the host boots.   Within 10-15 minutes of the mpss service start, both mic cards will go offline and fail to reset.  This occurs even with a initconfig setup.  In the messages file:

Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1387 node 2
Jul 22 13:29:49 gb49 kernel: Warning: Core image elf header not found
Jul 22 13:29:49 gb49 kernel: Kdump: vmcore not initialized
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1399 node 2 crash dump failed status -22
Jul 22 13:29:49 gb49 kernel: mic1: Transition from state online to lost
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1414 stopping node 2 to recover lost node!
Jul 22 13:29:53 gb49 kernel: micvnet_execute_stop: timeout waiting for link down message response
Jul 22 13:30:28 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:28 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:33 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:33 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:38 gb49 kernel: dma_mark_wait 1080 TO chan 0x1
Jul 22 13:30:38 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:43 gb49 kernel: dma_mark_wait 1080 TO chan 0x2
Jul 22 13:30:43 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:48 gb49 kernel: dma_mark_wait 1080 TO chan 0x3
Jul 22 13:30:48 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:53 gb49 kernel: mic1: Transition from state lost to resetting
Jul 22 13:30:55 gb49 kernel: mic1: Resetting (Post Code ??)
Jul 22 13:30:55 gb49 kernel: mic1: Transition from state resetting to reset failed
Jul 22 13:30:55 gb49 kernel: MIC 1 RESETFAIL postcode ?? -1

Any suggestions?

From micinfo (after boot):

Created Mon Jul 22 13:20:11 2013

System Info
HOST OS : Linux
OS Version : 2.6.32-358.6.2.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32851 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 85 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

 mic1 is identical (except serial number).

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

From the versions numbers, it appears you are running the latest version of the MPSS. Was this an update or is this a new system you are installing?  What is the OS release you are running on the host? Are you running anything when the system goes down or is it idle?

This is a new system and the issue has occured since purchase about 1 month ago.  I installed the latest version in an attempt to correct the issue.

The Host OS is Centos 6.4 running kernel 2.6.32-358.6.2.el6.x86_64.  In all occurances, the host and mic cards are idle.

Just to be sure, I removed the current mpss install.   Then booted to the 2.6.32-358.el6.x86_64 and reran the complete install again.

Installation worked as expected, and the mic cards booted.  I noticed that the Die temps seemed very high.

After 3min, and at 138C, I lost mic0.  A few minutes later, I lost mic1 at 136C.

If this is an overheating issue, what are the expect normal die temps for the mics?

They should be much cooler than that - I would consider anything over 80C too hot.  Do your cards have fans on them?  If not, they are passively cooled and require a specially designed chassis to properly remove the heat from them (like you tend to find in rack-mount systems), which you may not have if you are hitting over 100C!

This could be an overheating issue.What is the max threshold temperature?

Just to follow up,  we flashed the firmware on the on the system and Die temps are a reasonable 45C and 68C.

I'm glad the system is behaving itself now. If you get a chance, could you rerun micinfo and post the current system information? I am curious whether the Flash or SMC versions changed in any way. I thought what you had was already the latest version.

System has been very stable since the upgrade.  Here's the latest micinfo output:

MicInfo Utility Log

Created Tue Aug 13 14:43:32 2013

System Info
HOST OS : Linux
OS Version : 2.6.32-358.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32844 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1028000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 40 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

Device No: 1, Device Name: mic1

Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202064

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 48 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

Leave a Comment

Please sign in to add a comment. Not a member? Join today