porting Linux kernel 2.6.38.8 (Xeon Phi) to a newer version

porting Linux kernel 2.6.38.8 (Xeon Phi) to a newer version

Dear all,

 as a part of a research project I am porting the Linux kernel of the Xeon Phi from version 2.6.38.8 to 3.2.14.

I can successfully boot the kernel (i.e. I can see "<6>[   45.740227] Freeing unused kernel memory: 548k freed" in the dmesg buffer) and init has been called (anything that I had before switch_root gets printed on the dmesg buffer).

Now I am stucked because during the porting I possibly miss something. The error happen during either switch_root or while cpio-ing on tmpfs. I modified the kernel in order to enable send_IPI_all(NMI_VECTOR) in arch/x86/kernel/apic/hw_nmi.c:arch_trigger_all_cpu_backtrace(..). So, that I can have a kernel stack dump on the locked core. In your kernel NMIs, i.e. irq vector 0x2, are sent through arch/x86/kernel/apic/apic_phys_fla64.c:default_send_IPI_mask_sequence_phys(..). I did different experiments and no NMIs are received after calling default_send_IPI_mask_sequence_phys(..). I didn't do the exact same experiment on 2.6.38.8 but I did see that NMI are always zero (in /proc/interrupts).

1. Do I miss something about NMIs? Is the code on 2.6.38.8 completely updated to support NMIs?

2. Is there anyone else working on the porting of the Linux kernel for the Xeon Phi to another kernel version?

many thanks,
Antonio

 

6 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

I solved 1. I think 2.6.38.8 should be fixed accordingly.

After having a look at your KDB code in arch/x86/kdb/kdba_support.c:kdba_wait_for_cpus(..) where you are using NMI_VECTOR in this way:

apic->send_IPI_mask(cpumask_of(c), NMI_VECTOR);

I simply apply the same logic in arch/x86/kernel/apic/hw_nmi.c:arch_trigger_all_cpu_backtrace(..). The body of the function now look like this:

  printk(KERN_ERR "sending NMI to all CPUs:\n");
//  apic->send_IPI_all(NMI_VECTOR);
  apic->send_IPI_mask(to_cpumask(backtrace_mask),NMI_VECTOR);

So I am now able to debug RCU tree detected stalls. And I am getting the following:

<4>[   46.430117] CPU 87
<4>[   46.430117] Modules linked in:
<4>[   46.430117]
<4>[   46.430117] Pid: 1133, comm: kworker/87:1 Tainted: G        W    3.2.14+mpss3.2 #48  
<4>[   46.430117] RIP: 0010:[<ffffffff8102d204>]  [<ffffffff8102d204>] try_to_wake_up+0x94/0x1cf
<4>[   46.430117] RSP: 0018:ffff88015fc6b768  EFLAGS: 00000002
<4>[   46.430117] RAX: 000000000000000a RBX: ffff8801690e2d00 RCX: 000000000000000c
<4>[   46.430117] RDX: 000000000000000c RSI: 000000000000000a RDI: ffff8801690e32c0
<4>[   46.430117] RBP: ffff88015fc6b7b8 R08: 0000000000000000 R09: 000000000000001c
<4>[   46.430117] R10: 000000000000001c R11: ffff88016c4f1a40 R12: 0000000000000000
<4>[   46.430117] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8801690e32c0
<4>[   46.430117] FS:  00007ff5d786b700(0000) GS:ffff88016c4e0000(0000) knlGS:0000000000000000
<4>[   46.430117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   46.430117] CR2: 0000000000000a40 CR3: 0000000161109000 CR4: 0000000000000060
<4>[   46.430117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[   46.430117] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[   46.430117] Process kworker/87:1 (pid: 1133, threadinfo ffff880169342000, task ffff8801690e2d00)
<0>[   46.430117] Stack:
<4>[   46.430117]  ffff88016ac06c00 0000002800000000 00000057815b5900 0000000000000046
<4>[   46.430117]  ffff88016ac060c0 ffff8801690e2d00 0000000000000000 000000000000000b
<4>[   46.430117]  ffff880169111980 0000000000000000 ffff88015fc6b7c8 ffffffff8102d4d4
<0>[   46.430117] Call Trace:
<0>[   46.430117] Code: 89 c7 41 b4 01 e8 06 b3 ff ff 4c 89 f7 e8 b3 46 ff ff 45 85 e4 74 0f 44 8b 65 c4 e9 14 01 00 00 c5 fa ae f0 eb 05 b8 0a 00 00 00 <83> 7b 28 00 75 ef 48 8b 13 31 c0 80 e2 02 74 09 31 c0 f6 43 15

Still searching for help in porting the Xeon Phi kernel (2).

Antonio

2842         while (p->on_cpu) {
2843 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
2844                 /*
2845                  * In case the architecture enables interrupts in
2846                  * context_switch(), we cannot busy wait, since that
2847                  * would lead to deadlocks when an interrupt hits and
2848                  * tries to wake up @prev. So bail and do a complete
2849                  * remote wakeup.
2850                  */
2851                 if (ttwu_activate_remote(p, wake_flags))
2852                         goto stat;
2853 #else
2854                 cpu_relax();
2855 #endif
2856         }

The processor is basically infinite looping in this while cycle (in kernel/sched.c:try_to_wake_up(..)).

Nevertheless the only code touched by the Xeon Phi patch in sched.c is arch/x86/include/asm/processor.h:cpu_relax(). (because pause is not supported on Xeon Phi.)

Note that the offending process, kworker, is affine to one cpu only, plus here the task is singed as running, p->on_cpu is 1 (have a look, kernel/sched.c:task_running(..)).

Any advice?!

Antonio

Bild des Benutzers Frances Roth (Intel)

I know there are other users out there who have tried porting other kernels to the coprocessor; hopefully one of them will step forward with ideas. Personally, I am not much use when it comes to OS internals but I am asking around to see if I can find someone who has some words of wisdom to offer.

All,

I just released my initial porting of the MPSS3.2 Linux kernel for the Xeon Phi to Linux kernel 3.2.14. Now you can run Linux 3.2.14 on the Xeon Phi and (very soon) on the Xeon. The project is on sourceforge at the following url:

The porting includes two parts:

I added a small installation guide at:

Moreover the cpio image and /var/mpss/mic0/ directory I used, are available for download at:

Note that not all functionalities are tested nor supported.

Antonio

Bild des Benutzers Taylor Kidd (Intel)

Antonio,

Thanks for doing the work and sharing it with the community.

Regards
--
Taylor

Melden Sie sich an, um einen Kommentar zu hinterlassen.