"cat /proc/scif/resume" on card crashes host system

"cat /proc/scif/resume" on card crashes host system

a simple:

# cat /proc/scif/resume

on the card crashes the host system instantly.

Before diving into solving the issue, i only need a simple "confirmed"
to know that i am not at fault with my host-side mpss-modules port to
Linux-3.11.6 or with the software or hardware settings.

I am still able to see this message from the card:

# cat /proc/scif/resume
# Resuming/Waking up node

"uname -a" on the card gives the following output:
Linux sauron-mic0 2.6.38.8+mpss3.1 #1 SMP Tue Oct 15 11:49:30 PDT 2013 k1om GNU/Linux

this installation is more or less the result of "micctrl --initdefaults" after
unpacking of the following suse rpm's to "/":

glibc2.12.2pkg-libmicmgmt0-3.1-0.1.build0.glibc2.12.2.x86_64.rpm
libscif0-3.1-0.1.build0.glibc2.12.2.x86_64.rpm
mpss-boot-files-3.1-0.1.build0.glibc2.12.2.x86_64.rpm
mpss-daemon-3.1-0.1.build0.glibc2.12.2.x86_64.rpm
mpss-miccheck-bin-3.1-r1.glibc2.12.2.x86_64.rpm
mpss-micmgmt-3.1-0.1.build0.glibc2.12.2.x86_64.rpm

I was able to salvage the following kernel messages from the log server:

Nov 6 17:15:25 sauron kernel: [16796.780345] dmar: DRHD: handling fault status reg 2
Nov 6 17:15:25 sauron kernel: [16796.780355] dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr fdbcf000
Nov 6 17:15:25 sauron kernel: [16796.780355] DMAR:[fault reason 06] PTE Read access is not set
Nov 6 17:15:25 sauron kernel: [16797.071390] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
Nov 6 17:15:25 sauron kernel: [16797.071669] {1}[Hardware Error]: APEI generic hardware error status
Nov 6 17:15:25 sauron kernel: [16797.071862] {1}[Hardware Error]: severity: 1, fatal
Nov 6 17:15:25 sauron kernel: [16797.072012] {1}[Hardware Error]: section: 0, severity: 1, fatal
Nov 6 17:15:25 sauron kernel: [16797.072197] {1}[Hardware Error]: flags: 0x01
Nov 6 17:15:25 sauron kernel: [16797.072331] {1}[Hardware Error]: primary
Nov 6 17:15:25 sauron kernel: [16797.072452] {1}[Hardware Error]: section_type: PCIe error
Nov 6 17:15:25 sauron kernel: [16797.072619] {1}[Hardware Error]: port_type: 0, PCIe end point
Nov 6 17:15:25 sauron kernel: [16797.072797] {1}[Hardware Error]: version: 1.0
Nov 6 17:15:25 sauron kernel: [16797.072931] {1}[Hardware Error]: command: 0x0407, status: 0x2810
Nov 6 17:15:25 sauron kernel: [16797.073118] {1}[Hardware Error]: device_id: 0000:02:00.0
Nov 6 17:15:25 sauron kernel: [16797.073283] {1}[Hardware Error]: slot: 4
Nov 6 17:15:25 sauron kernel: [16797.073406] {1}[Hardware Error]: secondary_bus: 0x00
Nov 6 17:15:25 sauron kernel: [16797.073559] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x225d
Nov 6 17:15:25 sauron kernel: [16797.073758] {1}[Hardware Error]: class_code: 00400b
Nov 6 17:15:25 sauron kernel: [16797.073908] Kernel panic - not syncing: Fatal hardware error!
Nov 6 17:15:25 sauron kernel: [16797.076283] ------------[ cut here ]------------
Nov 6 17:15:25 sauron kernel: [16797.076439] WARNING: CPU: 0 PID: 38489 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x54/0x60()
Nov 6 17:15:25 sauron kernel: [16797.076746] Modules linked in: mic(O) dm_mod msr fuse x86_pkg_temp_thermal joydev kvm_intel kvm iTCO_wdt lpc_ich mfd_core [last unloaded: mic]
Nov 6 17:15:25 sauron kernel: [16797.077247] CPU: 0 PID: 38489 Comm: sshd Tainted: G W O 3.11.6-ainan #16
Nov 6 17:15:25 sauron kernel: [16797.077482] Hardware name: Dell Inc. PowerEdge T620/07HNGV, BIOS 2.0.19 09/02/2013
Nov 6 17:15:25 sauron kernel: [16797.077722] 000000000000007c ffff880ff490f4b0 ffffffff81c9c556 0000000000000007
Nov 6 17:15:25 sauron kernel: [16797.077985] 0000000000000000 ffff880ff490f4f0 ffffffff8106eec2 ffff880ff490f4e0
Nov 6 17:15:25 sauron kernel: [16797.078247] 0000000000000001 ffff88203f211d40 0000000000000001 ffff88103fa11d40
Nov 6 17:15:25 sauron kernel: [16797.078510] Call Trace:
Nov 6 17:15:25 sauron kernel: [16797.078594] [<ffffffff81c9c556>] dump_stack+0x46/0x58
Nov 6 17:15:25 sauron kernel: [16797.078761] [<ffffffff8106eec2>] warn_slowpath_common+0x82/0xb0
Nov 6 17:15:25 sauron kernel: [16797.078953] [<ffffffff8106ef05>] warn_slowpath_null+0x15/0x20
Nov 6 17:15:25 sauron kernel: [16797.079140] [<ffffffff81029a14>] native_smp_send_reschedule+0x54/0x60
Nov 6 17:15:25 sauron kernel: [16797.079355] [<ffffffff810a004b>] trigger_load_balance+0x18b/0x250
Nov 6 17:15:25 sauron kernel: [16797.084642] [<ffffffff810967a9>] scheduler_tick+0xa9/0xe0
Nov 6 17:15:25 sauron kernel: [16797.090027] [<ffffffff810799d4>] update_process_times+0x64/0x80
Nov 6 17:15:25 sauron kernel: [16797.095341] [<ffffffff810b4461>] tick_sched_handle.isra.11+0x31/0x40
Nov 6 17:15:25 sauron kernel: [16797.100578] [<ffffffff810b4554>] tick_sched_timer+0x44/0x70
Nov 6 17:15:25 sauron kernel: [16797.105639] [<ffffffff8108eb0a>] __run_hrtimer.isra.31+0x4a/0xd0
Nov 6 17:15:25 sauron kernel: [16797.110570] [<ffffffff8108f363>] hrtimer_interrupt+0x103/0x240
Nov 6 17:15:25 sauron kernel: [16797.115423] [<ffffffff8109f381>] ? load_balance+0xf1/0x740
Nov 6 17:15:25 sauron kernel: [16797.120271] [<ffffffff8102bbd6>] local_apic_timer_interrupt+0x36/0x60
Nov 6 17:15:25 sauron kernel: [16797.125197] [<ffffffff8102bfde>] smp_apic_timer_interrupt+0x3e/0x60
Nov 6 17:15:25 sauron kernel: [16797.130160] [<ffffffff81cad14a>] apic_timer_interrupt+0x6a/0x70
Nov 6 17:15:25 sauron kernel: [16797.135188] [<ffffffff81094a4e>] ? finish_task_switch+0x4e/0xe0
Nov 6 17:15:25 sauron kernel: [16797.140235] [<ffffffff81ca424b>] __schedule+0x41b/0x990
Nov 6 17:15:25 sauron kernel: [16797.145307] [<ffffffff81bbbbf9>] ? __intel_map_single+0x159/0x1c0
Nov 6 17:15:25 sauron kernel: [16797.150441] [<ffffffff810841a3>] ? start_flush_work+0x103/0x140
Nov 6 17:15:25 sauron kernel: [16797.155594] [<ffffffff81ca4a84>] schedule+0x24/0x70
Nov 6 17:15:25 sauron kernel: [16797.160831] [<ffffffff81ca3975>] schedule_hrtimeout_range_clock+0x115/0x130
Nov 6 17:15:25 sauron kernel: [16797.166087] [<ffffffff8147eecb>] ? tty_ldisc_try+0x4b/0x60
Nov 6 17:15:25 sauron kernel: [16797.171378] [<ffffffff8147e2c8>] ? tty_write_room+0x18/0x20
Nov 6 17:15:25 sauron kernel: [16797.176643] [<ffffffff8147a7db>] ? n_tty_poll+0x1eb/0x200
Nov 6 17:15:25 sauron kernel: [16797.181893] [<ffffffff81ca399e>] schedule_hrtimeout_range+0xe/0x10
Nov 6 17:15:25 sauron kernel: [16797.187156] [<ffffffff811392aa>] poll_schedule_timeout+0x5a/0xc0
Nov 6 17:15:25 sauron kernel: [16797.192382] [<ffffffff81139deb>] do_select+0x70b/0x7b0
Nov 6 17:15:25 sauron kernel: [16797.197452] [<ffffffff811394d0>] ? __pollwait+0xf0/0xf0
Nov 6 17:15:25 sauron kernel: [16797.202342] [<ffffffff811394d0>] ? __pollwait+0xf0/0xf0
Nov 6 17:15:25 sauron kernel: [16797.206999] [<ffffffff811394d0>] ? __pollwait+0xf0/0xf0
Nov 6 17:15:25 sauron kernel: [16797.211413] [<ffffffff811394d0>] ? __pollwait+0xf0/0xf0
Nov 6 17:15:25 sauron kernel: [16797.215650] [<ffffffff81095a64>] ? check_preempt_curr+0x84/0xa0
Nov 6 17:15:25 sauron kernel: [16797.219814] [<ffffffff81095a92>] ? ttwu_do_wakeup+0x12/0x90
Nov 6 17:15:25 sauron kernel: [16797.223875] [<ffffffff81095a64>] ? check_preempt_curr+0x84/0xa0
Nov 6 17:15:25 sauron kernel: [16797.227967] [<ffffffff81095a92>] ? ttwu_do_wakeup+0x12/0x90
Nov 6 17:15:25 sauron kernel: [16797.232036] [<ffffffff81097b8e>] ? try_to_wake_up+0x22e/0x2a0
Nov 6 17:15:25 sauron kernel: [16797.236002] [<ffffffff81097c0d>] ? default_wake_function+0xd/0x10
Nov 6 17:15:25 sauron kernel: [16797.239961] [<ffffffff810939e8>] ? __wake_up_common+0x58/0x90
Nov 6 17:15:25 sauron kernel: [16797.243901] [<ffffffff8113a08d>] core_sys_select+0x1fd/0x2f0
Nov 6 17:15:25 sauron kernel: [16797.247877] [<ffffffff8109af3a>] ? set_next_entity+0x7a/0xe0
Nov 6 17:15:25 sauron kernel: [16797.251827] [<ffffffff81ca424b>] ? __schedule+0x41b/0x990
Nov 6 17:15:25 sauron kernel: [16797.255745] [<ffffffff81bccb39>] ? sock_getsockopt+0xd9/0x740
Nov 6 17:15:25 sauron kernel: [16797.259651] [<ffffffff814777e0>] ? tty_write+0x1d0/0x2a0
Nov 6 17:15:25 sauron kernel: [16797.263529] [<ffffffff8147a9e0>] ? n_tty_ioctl+0xd0/0xd0

3 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi Ahmet,

Sorry for the delay. I confirm that only issuing the follow command in the coprocessor will cause the host to stop communicating with the coprocessor:

# cat /proc/scif/resume

However, I think what you need to do is to issue the command to "suspend" before "resume" SCIF service. In other words, issuing the following commands in this order should not cause any problem at all:

# cat /proc/scif/suspend

# cat /proc/scif/resume 

Hope this help. Thank you.

I've updated to mpss-3.1.1 and applied an flash update to see if this is still an issue:

[root@sauron-mic0 ~]# cat /proc/scif/suspend 
[root@sauron-mic0 ~]# cat /proc/scif/resume

And again, the host system crashes hard and waiting for the watchdog to reset it.
It really shouldn't crash the host system in any case.
At least, only the root user on the mic card has access to "/proc/scif", thus preventing this incident:

[ainan@sauron-mic0 ~]$ cat /proc/scif/resume
cat: can't open '/proc/scif/resume': Permission denied

So i will listen to the doctor and not touch it if it hurts :)

Ahmet

发表评论

登录添加评论。还不是成员?立即加入