Kernel oops with Intel Xeon Phi 5110P

Kernel oops with Intel Xeon Phi 5110P

One one of our cluster nodes with the Intel Xeon Phi 5110P steeping B01 we are noticing the node is in a oops reboot loop. If we remove the Phi cards the node boots fine. We are running CentOS 6.3 on these nodes. With the Phi cards installed we get the following oops every single time.

EDAC sbridge: Seeking for: dev 0d.6 PCI ID 8086:3cf5
EDAC MC0: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#0': DEV 0000:7f:0e.0
EDAC MC1: Giving out device to 'sbridge_edac.c' 'Sandy Bridge
Socket#1': DEV 0000:ff:0e.0
EDAC sbridge: Driver loaded.
[ OK ]
vnet: mode: dma, buffers: 62
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
mic 0000:03:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x4010, status: 0x0547
[Hardware Error]: device_id: 0000:00:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x03
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[Hardware Error]: aer_status: 0x00004000, aer_mask: 0x00100000
[Hardware Error]: Completion Timeout
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x0037f030
Kernel panic - not syncing: Fatal hardware error!
Pid: 1671, comm: work_for_cpu Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
<NMI> [<ffffffff814fd11a>] ? panic+0xa0/0x168
[<ffffffff8130012c>] ? ghes_notify_nmi+0x17c/0x180
[<ffffffff81503325>] ? notifier_call_chain+0x55/0x80
[<ffffffff8150338a>] ? atomic_notifier_call_chain+0x1a/0x20
[<ffffffff810980ae>] ? notify_die+0x2e/0x30
[<ffffffff81500fd1>] ? do_nmi+0x1a1/0x2b0
[<ffffffff815008b0>] ? nmi+0x20/0x30
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffff812a5cad>] ? msi_set_mask_bit+0x4d/0x90
<<EOE>> [<ffffffff812a5d00>] ? unmask_msi_irq+0x10/0x20
[<ffffffff810ddc09>] ? default_enable+0x29/0x40
[<ffffffff810ddbce>] ? default_startup+0x1e/0x30
[<ffffffff810dc89a>] ? __setup_irq+0x32a/0x3c0
[<ffffffff810dd024>] ? request_threaded_irq+0x154/0x2f0
[<ffffffffa0151c90>] ? mic_irq_isr+0x0/0x30 [mic]
[<ffffffffa015238d>] ? mic_probe+0x3fd/0x5d0 [mic]
[<ffffffff81060262>] ? default_wake_function+0x12/0x20
[<ffffffff8108cc00>] ? do_work_for_cpu+0x0/0x30
[<ffffffff81292037>] ? local_pci_probe+0x17/0x20
[<ffffffff8108cc18>] ? do_work_for_cpu+0x18/0x30
[<ffffffff81091d66>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Rebooting in 30 seconds..
ACPI MEMORY or I/O RESET_REG.

Any help in resolving this issue is appreciated.

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm a bit out of my league on this one but just in case a knowledgable person doesn't come along with an answer, I am going to try to dig into this. Was the node working with the coprocessor cards in it before this or is this the first time you tried to put cards into the node? If it was working before, what changed? (For example, new MPSS?)

This node was working, nothing has changed on the node since then. No software or hardware changes. Initial tests were done and after the configuration was finalized the node was restarted.

Hi Bharath,

is it possible for you to de-install the MPSS stack and then capture and send the output of “lspci –vvvv” ?

The crash is happening very early on during driver probe and hence the request for lspci output.

 

Hi Belinda,

Since we are in a time crunch we went ahead and reinstalled the node from using our system image and for whatever reason the system is booting fine now. The image is no different from what was on it previously, I am attaching the output of "lspci -vvvv".

Attachments: 

AttachmentSize
Download lspci-br316.txt55.94 KB

I am not sure if this is related to the problem we faced earlier before the re-install but the node is relatively less stable than the other. We end up with the node rebooting often by itself when under user. The error that we see constantly over serial console being "Critical Interrupt #0x04" before the reboot. I am going to get the node reinstalled without mpss and grab the output of "lspci -vvvv" is there anything else that might be useful to further help in debugging this issue.

if you could collect 'dmesg' output (before you reinstall) since that would have a bit more detail about the failure.....

is this system an OEM provided box?

 

The system is an OEM provided box, the motherboards are based of the Intel S2600JF family server board customized for our vendor. The dmesg output before is attached. This before reinstallation, the system is still has MPSS installed. I have asked our admins to create an install image for this specific node without MPSS so that I can get the "lspci -vvvv" output.

Attachments: 

AttachmentSize
Download br316-dmesg.txt85.69 KB

Judging by the call stack functions names I suppose that crash occurred when MIC was at IRP handler routine probably executing an ISR.I suppose that interrupt processed was non - maskable interrupt nmi and do_nmi was the interrupt handler , next there is probably some issue rekated to atomically setting/unsetting some values(could register  bitfield(s) access) and panic routine is called from  ghes_notify_nmi+0x17c/0x180.

I did a little research about notifier routines and I found that these functions are used to inform the registered notifiers about the events.Thus when the kernel wants receive an event this routine will be called notifier_call_chain with a ponter to the event passed as a parameter. Regarding the ghes_notify_nmi routine it will call oops_begin() routine when the sev_global value is >= SEV_GLOBAL_PANIC.

We our currently still facing stability issue with one of our mic nodes. The issue comes up when ofed-mic is enabled. When the oops occurs is non-deterministic but most times the system is idle when the oops occurs. The last oops that we were able to capture over serial console

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 1, fatal
[Hardware Error]: section: 0, severity: 1, fatal
[Hardware Error]: flags: 0x01
[Hardware Error]: primary
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x4010, status: 0x0547
[Hardware Error]: device_id: 0000:80:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x81
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003
[Hardware Error]: aer_status: 0x00004000, aer_mask: 0x00100000
[Hardware Error]: Completion Timeout
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Requester ID
[Hardware Error]: aer_uncor_severity: 0x0037f030
Kernel panic - not syncing: Fatal hardware error!
Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff814fd11a>] ? panic+0xa0/0x168
 [<ffffffff8130012c>] ? ghes_notify_nmi+0x17c/0x180
 [<ffffffff81503325>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8150338a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff810980ae>] ? notify_die+0x2e/0x30
 [<ffffffff81500fd1>] ? do_nmi+0x1a1/0x2b0
 [<ffffffff815008b0>] ? nmi+0x20/0x30
 [<ffffffff812cd8a1>] ? intel_idle+0xb1/0x170
 <<EOE>>  [<ffffffff81408798>] ? menu_select+0x178/0x390
 [<ffffffff81407627>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110
 [<ffffffff814e433a>] ? rest_init+0x7a/0x80
 [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Dumping qib trace buffer from panic
Done dumping qib trace buffer
Rebooting in 30 seconds..

The node is stable without ofed-mic turned on.

Hi Bharath,

I suppose that the culprit can be found in some hard error which sets the variable sev_global to true.It is probably related to PCIe bus functionality.There is plenty of info related to error but it needs to read the PCIe specification which is restricted to members only.

I  have found the description of command register 0x04 offset and it seems that the value 10 is related to bit location which itself controls the generation of INTx interrupt.If this bit is set the functions are not allowed to create the interupts.In your case I think nmi interuupt is processed.Next there is status 0x0547 I do not know if this is status of status register or value of command register.I think that problem is related to nmi interrupt.

I am filing an internal bug report in Intel(R) Premier support to see what our MPSS experts say

Besides the dmesg output, is it possible for you to run the attached script and share with us the output?

  1. rename 'micdebug.txt' to 'micdebug.sh'; chmod +x micdebug.sh
  2. run it:  sudo ./micdebug.sh ; and reply to this thread, attaching the resulting zip file

If you are not able to upload the zip file to this forum thread or don't have a means to share the data (through some URL), can you at the very least share the output of /opt/intel/mic/bin/micinfo and /opt/intel/mic/bin/miccheck

thank you

Attachments: 

AttachmentSize
Download micdebug.txt1.96 KB

I am attaching two zip files micdebug-08042013-142720.zip (mpss stopped) and micdebug-08042013-144411.zip (mpss running). I didnt cleanup the dmesg output they have our IPTables logs as well. If you want I can remove them and re-upload.

Attachments: 

The attached file micdebug-08042013-165159.zip is the output of the script with ofed-mic running. The instability occurs only when ofed-mic service has been started.

Attachments: 

AttachmentSize
Download micdebug-08042013-165159.zip67.64 KB

One more thing:  can you provide lspci -vvv output and could you provide any more details about the platform you are using (what OEM, model, BIOS..)

I have attached the output of "lspci -vvv" as you have requested. With regards to the platform:
OEM manufacturer: Appro (now Cray Inc)
BIOS version: SE5C600.86B.01.06.0001.090720121056
Motherboard family: Intel S2600JF

Sorry, I forgot to attach the output of lspci. Attached in this post.

Attachments: 

AttachmentSize
Download br316-lspci.txt55.94 KB

Hi,

we want to isolate whether the power management on the card might be triggering  this.  Can you turn off power management and retry to see if this can be reproduced ? 

 

1) modify the mic0.conf and mic1.conf (in the /etc/sysconfig/mic/ directory ) with the following change

PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"

 

2) restart the mpss service

user_prompt> service mpss restart

 

 If there is still an issue, will it be possible to access your system remotely for further debug?

 

Hi Belinda,

Could PCIe interface be blamed for kernel panic?

Quote:

BELINDA L. (Intel) wrote:

Hi,

we want to isolate whether the power management on the card might be triggering  this.  Can you turn off power management and retry to see if this can be reproduced ? 

 

1) modify the mic0.conf and mic1.conf (in the /etc/sysconfig/mic/ directory ) with the following change

PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"

 

2) restart the mpss service

user_prompt> service mpss restart

 

 If there is still an issue, will it be possible to access your system remotely for further debug?

 

I will get the config changes done and let you know if the issue persists. With regards to remote access I will talk to the admins and other people concerned on setting up remote access for trouble shooting.

The system has hasnt oopsed in the last 72 hours as opposed to earlier. Seems like turning of the power management does help. If power management is triggering this oops is this an issue with the hardware or the s/w side of the power management.

Login to leave a comment.