Troubleshooting HOWTO: Bad hardware? MPSS? Configuration?

Troubleshooting HOWTO: Bad hardware? MPSS? Configuration?

Are you having problems with your hardware (Cannot see your Intel(R) Xeon Phi(tm) coprocessor?  Sporadic accessibility?) or with the Intel(R) Manycore Platform Software Stack (Intel(R) MPSS) running reliably?

Attached to this post are PDF "flowcharts" that explain how you can troubleshoot the problem (note:  Both Linux and Windows flowcharts are available), and shows what information you will want to collect if you need to escalate your issue to your OEM provider or Intel.

We hope this is is useful to you!   Please let us know if you have found a boundary condition not comprehended properly by this "flow".

38 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

I have experienced a problem that looks like a bug in the the 64 bit memory to stack push instruction.

I am porting the Glasgow Pascal compiler to the MIC and have run into an error that looks very like a bad implementation of PUSH

It appears that a push instruction  of the form:

 push QWORD[   r8* 8+ label]

actually pushes the quadword at

 push QWORD[   r8* 8+ label140ba08d9aadf+8]

Here are the relevant source lines along with the relevant assembler lines that they translate into

First we have a call on a run time library function written in C using C parameter passing.

;writeln( shiftindex[d,0]);

 

; note that shiftindex is declared as array[0..4,0..1] of integer

 mov  rcx,         5                          ; field width info

 mov  rdx,         12                         ; field width info

 mov bl,BYTE  ptr  [  rbp+         -49]

 movsx r8,     bl

 imul   r8,        8

 movsx rsi,  dword   ptr  [ r8+ label140ba08d9aadf] ; the parameter for the value to be printed

 movsx rdi,  dword   ptr  [  unit$system$base+         -24] ; the file it will be sent to

.ifndef definedprintint

definedprintint=1

.extern                  printint

.endif

 call printint;#imported

;--------

; this correctly prints out the 0th element of the row of the array shiftindex

 

 

Now we call a pascal function passing a row of the array by value on the stack using a push instruction to place the row on the stack

 

;compareImagePair (shiftindex[d])

; d is a byte

; #297

 mov bl,BYTE  ptr  [  rbp+         -49]

 movsx r8,     bl

 push QWORD[   r8* 8+ label140ba08d9aadf]

 call label140ba08d9abe3

; this passes to the function the d+1 th element of the array shiftindex ; in other words the push function fetches the wrong element from the array ; as compared to the mov instruction used earlier

 

Printout from programme

First the contents of the shiftindex array

           0           0

           0          -1

           0           1

          -1           0

           1           0

 

           d          shiftindex[d,0]

           3          -1

what we get inside the function compareImagePair when we print the parameter

dirvec =           1           0

I have now concluded that this is a bug in the assembler distributed with the MIC, if you replace the line

 push QWORD[   r8* 8+ label140ba08d9aadf]

with

 push QWORD ptr [   r8* 8+ label140ba08d9aadf]

it fetches the correct value not a value 8 on from the correct address.

Thanks for isolating this bug. It has been reported to the team that owns the assembler.

It appears that the x86_64 assembler does the same thing.  

 

 The error is that "QWORD ptr" must be used here, as Paul realized.  The fact that QWORD alone is allowed may be a bug, which we need to discuss internally;    If AT&T syntax is used, what happens?

 

 

I am trying to install Xeon Phi card on a Supermicro server (http://www.supermicro.com/products/superblade/module/sbi-7127rg.cfm). According to the flow chart, I need to "Enable support for mapping >4GB MMIO in the host BIOS" . However, I cannot see the MMIO setup option in BIOS even after upgrading to the latest version. Could anyone please give me some suggestions? 

Hi Yue,

from a quick view on your link it looks like it's an old server which might not support Xeon Phi at all. You might check with Supermicro whether this server could host Xeon Phi.

 

otherwise it might also be the case that your BIOS has this option by default - is your card detected at all?

Belinda,

I've installed MPSS for Windows on a Windows 7 Pro x64 system. I can get the two Xeon Phi 5510P cards up and running, firmware updates, cards boot, micinfo shows both cards, cards ready, I can ping both cards. MicSmc-gui.exe shows both cards ideling along, ...

I can compile my first project selected in the tutorials coi folder "hello_world". The project compiles and runs fine up to the point where it wants to launch the native side app hello_world_sink_mic, which is not built by the solution (as separate project).

Launching an Intel Parallel Studio XE 2013 command prompt for use with Visual Studio 2012, and navigating to the demo folder (under C:\Program Files\..." and issuing

icl -Qmic hello_world_sink.cpp -o hello_world_sink_mic

I receive an error stating stdio.h cannot be found, check MPSS environment variables.

If I remove the -Qmic (not what I want as this compiles as host app) I get an error writing the .obj file (due to folder off C:\Program Files\..."

If I copy the MPSS folder elsewhere (not under protected folder)...

compile with -Qmic fails with stdio.h not found

compile without -Qmic succeeds.

IOW -Qmic expects a different set of environment variables (with respect to INCLUDE)

How do I properly set the environment variable(s) for compiling the coprocessor side (-Qmic) of the demo programs under Windows?

Jim Dempsey

www.quickthreadprogramming.com

We installed the 3120A card in Windows7 box, the card is blinking blue. Installed MPSS 3.1.2. The card is not displayed in Device Manager (?). "micctrl -s" command results in error: 

Error manipulating coprocessor: Intel(R) Xeon Phi(TM) coprocessor driver is not loaded or you have insufficient access

Hi Alex, can you check the following (this is based on similar forum posts reported earlier this month)

  1. physically inspect the card installation - is the card inserted properly, and are all power connectors on the card plugged in properly

   2. If you are working with a numa machine, where some of the PCI slots are enabled or disabled, you need to make sure the coprocessor is installed on an enabled PCI slot.

 

 

 

Thank you, BELINDA! Switching to another PCI slot worked!

 

Thanks for this.  Really helpful.  Unfortunately the problem we are seeing is the driver crashing when the machine boots.  I've attached the stacktrace from the logs.  This is from the latest MPSS on RHEL 6.

 

Allegati: 

AllegatoDimensione
Download mic-stacktrace.txt9.25 KB

The other problem that we are having is actually with NFS exporting GPFS shares.  Since the GPFS drivers and client software does not support MIC, we NFS export the drives from each host to its MICs.  It is very unreliable though, and so we find that the MICs will not mount the drives sometimes, citing "stale NFS filehandle" as the cause, which is untrue.  It seems related to the order of the mounts in /etc/fstab, as the first one will mount and the second won't.

Ideally we'd like GPFS binaries for MIC, as this is a kludge anyway.  In the current state we can't really say to users that the systems are ready to use.

(We'd also really like MPSS to support OFED 2.x, since that is what the rest of the machine is using.  Only the nodes with MICs in are on 1.5.x, and that's entirely due to needing it to support the IPoIB software provided with MPSS.)

Hi Zaniyah, 

is your mic stracktrace from a consistently failing coprocessor (and can you send the tarball that gets created by the micdebug.sh script?)

As for GPFS -- are you in a position where you can ask IBM for their plans to support GPFS with Intel Xeon Phi Coprocessors ?   You can even tell them that there is now a Lustre client (was recently released, we'll provide a writeup on that soon).

I will make sure to pass on your comments about wanting OFED 2.x support.

 

Hi Belinda,

I have installed mpss-3.2 for first use of Xeon Phi, but I can not know which version of Flash is installed and I can not update it.

Neither can I start mpss service.

Here are several results of commands :

sudo micinfo
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Thu Mar 27 08:18:47 2014


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.5.1.el6.x86_64
		Driver Version		: 3.2-1
		MPSS Version		: 3.2
		Host Physical Memory	: 65918 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : NotAvailable
		SMC Firmware Version	 : NotAvailable
		SMC Boot Loader Version	 : NotAvailable
		uOS Version 		 : NotAvailable
		Device Serial Number 	 : NotAvailable
...
sudo micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: No valid image found
micsmc
DEBUG: ***** MicSettings(parent)::fileName():  "/home/vivi/.config/Intel Corp/MicSmcGUI.ini" 
DEBUG: ***** SessionSettings(parent)::fileName():  "/home/vivi/.config/Intel Corp/MicSmcGUI.ini" 
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
Infos Web mic0 : Connexion avec le p�riph�rique r�tablie.
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
Infos Web mic0 : Connexion avec le p�riph�rique r�tablie.
Avertissement�: mic0 : Connexion avec le p�riph�rique perdue !
sudo service mpss start
Starting Intel(R) MPSS:                                    [ÉCHOUÉ]

May you help me to find what is the trouble ?

Thanks in advance.

Virginie

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

I was thinking that there was a probleme beacause of the kenrnel version.

So I restart with the original kernel version and reinstall MPSS. But here is the result of micctrl --initdefaults :

micctrl --initdefaults
micctrl(segv_handler+0x18) [0x4070c8]
/lib64/libpthread.so.0() [0x34cb40f710]
/usr/lib64/libmpssconfig.so.0.0.1(_add_miclist_not_present+0xb8) [0x7f5c84a35b98]
/usr/lib64/libmpssconfig.so.0.0.1(mpss_get_miclist+0x4d) [0x7f5c84a35e7d]
micctrl(create_miclist+0x1cd) [0x42123d]
micctrl(parse_config_args+0x370) [0x40db60]
micctrl(main+0x236) [0x40df56]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x34cb01ed1d]
micctrl() [0x406d29]

 

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Virginie, can you send us the output of /usr/bin/micdebug.sh (just attach the tarball to this thread).   That would be most helpful.

Hi Belinda !

I send you the last one, but if you want I have 2 other ones (made on Tuesday and Wednesday).

The usual commands I use to try as root :

  1. lspci | grep proc
  2. setenforce 0
  3. modprobe mic
  4. service mpss start

Thanks in advance.

Allegati: 

AllegatoDimensione
Download micdebug_20140328_095540utc.tgz799.35 KB
Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Virginie,

the 'lspci -vvv' output on your host shows some weird things for the coprocessor (look for Co-processor in the output).   

Here is what it shows for you:

 

------

04:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: mic

------

Here is what it should normally show: (as an example)

-----

01:00.0 Co-processor: Intel Corporation Device 2250 (rev 11)
    Subsystem: Intel Corporation Device 2500
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 32
    Region 0: Memory at 380c00000000 (64-bit, prefetchable) [size=8G]
    Region 4: Memory at fb700000 (64-bit, non-prefetchable) [size=128K]
    Capabilities: [44] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [4c] Express (v2) Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
        Vector table: BAR=4 offset=00017000
        PBA: BAR=4 offset=00018000
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Kernel driver in use: mic

----

 

 

I found someone else in this forum who has similar hardware to yours:

Manufacturer: ASUSTeK COMPUTER INC.
    Product Name: P9X79 WS

he uses CentOS (6.4) vs. yours (6.5), using an older MPSS (3.1.x) vs yours (3.2). 

Let me ask a couple of questions:

   - is this the first time you've installed this coprocessor? (that seems to be the case based on what you've said before)

   - have you tried plugging the coprocessor into any other slot in your system

   - did you  change anything in your system's BIOS? (i.e. you need to enable BIOS support for memory mapped I/O address ranges above 4GB? )

    - we may have to look further into the BIOS -- I have some BIOS update files from someone who, like I said before, had his ASUS functioning. ".   I could forward these to you.   The version he has working is P9x79-WS-ASUS-4306.CA.   what is yours?

 

Hi Belinda !

I managed to start MPSS service and to update micflash.

The main probleme was a thermal one. We have installed one more fan just for the Xeon Phi coprocessor.

But there is still a message that is not correct when I try to get the version of micflash.

Here it is :

micflash -getversion -device 0
mic0: Flash read started
mic0: Read done
mic0: Version: 2.1.02.0390
mic0: Transitioning to ready state
micflash: mic0: Failed to read post code: read: /sys/class/mic/mic0/post_code: No such device or address

and then :

micctrl -s
mic0: reset failed

May you help me please ?

 

Edit :

After rebooting output for miccheck :

miccheck
MicCheck 3.2-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass

Status: OK

 

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Cita:

BELINDA L. (Intel) escribió:

the 'lspci -vvv' output on your host shows some weird things for the coprocessor (look for Co-processor in the output).

Here is the new one :

08:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (r 
ev 20) 
        Subsystem: Intel Corporation Device 7d95 
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step 
ping- SERR- FastB2B- DisINTx- 
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx- 
        Latency: 0, Cache Line Size: 64 bytes 
        Interrupt: pin A routed to IRQ 11 
        Region 0: Memory at 380800000000 (64-bit, prefetchable) [size=16G] 
        Region 4: Memory at d3200000 (64-bit, non-prefetchable) [size=128K] 
        Capabilities: [44] Power Management version 3 
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot 
-,D3cold-) 
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- 
        Capabilities: [4c] Express (v2) Endpoint, MSI 00 
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us 
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupporte
d- 
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ 
                        MaxPayload 128 bytes, MaxReadReq 512 bytes 
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPe 
nd- 
                LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 
 <4us, L1 unlimited 
                        ClockPM- Surprise- LLActRep- BwNot- 
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- 
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- 
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- 
 BWMgmt- ABWMgmt- 
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF N 
ot Supported 
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OB 
FF Disabled 
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- 
                         Transmit Margin: Normal Operating Range, EnterModifiedC 
ompliance- ComplianceSOS- 
                         Compliance De-emphasis: -6dB 
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, 
 
 EqualizationPhase1- 
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- 
        Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ 
                Address: 0000000000000000  Data: 0000 
        Capabilities: [98] MSI-X: Enable- Count=16 Masked- 
                Vector table: BAR=4 offset=00017000 
                PBA: BAR=4 offset=00018000 
        Capabilities: [100 v1] Advanced Error Reporting 
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- 
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- 
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- 
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- 
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ 
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- 

 

Cita:

BELINDA L. (Intel) escribió:

 

I found someone else in this forum who has similar hardware to yours:

Manufacturer: ASUSTeK COMPUTER INC.

    Product Name: P9X79 WS

he uses CentOS (6.4) vs. yours (6.5), using an older MPSS (3.1.x) vs yours (3.2). 

Is it possible to know who he is to ask him questions if neccessary ?

Cita:

BELINDA L. (Intel) escribió:

Let me ask a couple of questions:

   - is this the first time you've installed this coprocessor? (that seems to be the case based on what you've said before)

I tried several times but it is the first time I installed it.

Cita:

BELINDA L. (Intel) escribió:

   - have you tried plugging the co-processor into any other slot in your system

I tried 3 of the 7 slots. Finally I choose the one in the middle to avoid too much heat.

Cita:

BELINDA L. (Intel) escribió:

   - did you  change anything in your system's BIOS? (i.e. you need to enable BIOS support for memory mapped I/O address ranges above 4GB? )

I enabled addresses above 4 GB as soon as I have installed the new motherboard.

Moreover I have just boosted the speed of the fan for the Xeon Phi co-processor.

Cita:

BELINDA L. (Intel) escribió:

    - we may have to look further into the BIOS -- I have some BIOS update files from someone who, like I said before, had his ASUS functioning. ".   I could forward these to you.   The version he has working is P9x79-WS-ASUS-4306.CA.   what is yours?

I do not know which is the version of BIOS. I will tell it later.

Thanks for your help.

I still need help as you read it above !

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Virginie -

1. what are the results of 'ls /sys/class/mic/mic0'

2. You indicated that you managed to start mpss -- does the process table show mpssd running (ps auxw | grep mpss)?  were there any errors resulting from the startup (service mpss start)?

3. can you obtain and send another capture of micdebug.sh (now that you corrected the thermal issue - hopefully ); we're specifically interested in what micinfo and dmesg commands say.   dmesg or /var/log/messages may have some indication of what is happening.   micdebug.sh collects all of this data in one shot.

4. Are you pretty sure that there aren't any lingering thermal issues even after your changes in fan speeds and coprocessor/slot positioning?

Hi Belinda !

1.

ls /sys/class/mic/mic0
active_cores      flash_update       memoryvoltage  scif_status
boot_count        flashversion       memsize        serialnumber
cmdline           fuse_config_rev    mode           sku
crash_count       image              model          state
dev               initramfs          pc3_enabled    stepping
device            interface_version  pc6_enabled    stepping_data
extended_family   kernel_cmdline     pc6_timeout    substepping_data
extended_model    log_buf_addr       platform       subsystem
fail_safe_offset  log_buf_len        post_code      uevent
family            meminfo            power          virtblk_file
family_data       memoryfrequency    processor

2. There is no more errors when I start mpss service.

ps auxw | grep mpss
root      3736  0.0  0.0 194864   932 pts/0    Sl   08:14   0:00 /usr/sbin/mpssd
root      3866  0.0  0.0 105320   912 pts/0    S+   08:21   0:00 grep mpss

3. The output of micdebug.sh is in attachment.

4. I can not be sure, but yesterday I logged on mic0 by ssh for all the morning without any interruption. It was the first time I managed to run mpss service longer than 1 minute. Moreover I touched the co-processor many times before closing my computer yesterday morning without burning me as I did previously every time I booted up my computer. It was still hot but not burning. Before it was burning every time for a few minutes and when the mic status was becoming "reset failed" it was becoming cold again. I send you an image of the co-processor and the fan before I installed it.

Thanks and have a nice day.

Allegati: 

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Virginie,

if your coprocessor came up and was accessible for a short amount of time, then it's quite possibly a cooling or power management issue.

Do you know whether the ASUS system you are using is qualified to run this coprocessor?  Is this something you can check with your hardware provider?

One thing to try (temporarily -- I would not recommend leaving power management off for extended periods of time) is to do this:

sudo service mpss stop

sudo micctrl --pm=off

sudo micctrl --resetconfig  (shouldn’t be necessary but won’t hurt anything)

sudo service mpss start

 

and see if the coprocessor manages to stay up for a while.   Let me know how that goes.   

Hi Belinda !

Nowadays I can start MPSS for an entire day without any problem. But for the moment I do not make it work a lot !

I did not try what you explain last time because when I stop MPSS service I can not restart it without rebooting my computer.

Now that the hardware problems seem to have been resolved I have a new problem.

When I try to run a MPI program on the Xeon Phi co-processor, I have this message :

mpirun -n 20 -host mic0 /tmp/myprog.mic
pmi_proxy: /bin/pmi_proxy: cannot execute binary file
pmi_proxy: /bin/pmi_proxy: Success

I can run the myprog.host on my computer without any problem but I can not run it on MIC neither from my PC nor from mic0 after a ssh.

Thank you for your help.

(I am not sure it is the good place for that new question but I did not found it.)

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Belinda !

Today I succeeded in running several programs on the Xeon Phi co-processor but only after a ssh connection.

When I try from the host I have this error :

mpirun -n 2 -host mic0 /Essais_MPI/myprog.mic
[proxy:0:0@mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "mic0" to "myIPaddress" (No route to host)
[proxy:0:0@mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server myIPaddress at port 48973 (check for firewalls!)

The port is each time different.

Thanks.

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Virginie,

I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.

By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.

Frances

Hi Belinda !

Yesterday I have been able to run MPI programs on symmetric mode (both host and co-processor) and after ssh directly on mic0.

But today I have troubles again.

MPSS starts correctly and 3 minutes later the status of mic0 is lost. and I can not reset it.

# micctrl -s
mic0: lost
# service mpss status
mpss is running
# micctrl -rw
          mic0: resetting
          mic0: reset failed

Even when everything seems to run fine I am not able to reboot or reset with micctrl.

I tried the steps you told me on your message #26 but it failed.

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

Hi Frances !

Cita:

Frances Roth (Intel) escribió:

I wonder if there is a problem with the configuration on the coprocessor. When you installed the MPSS, did you configure the network using micctrl or did you edit the mic0.conf directly? If you edited mic0.conf directly, did you use micctrl to push changes out afterward? ssh sets up a tunnel from the host to the coprocessor when it connects, which may be why it works and mpi does not.

When I installed the MPSS I used micctrl to configure the network.

Cita:

Frances Roth (Intel) escribió:

By the way, it might be easier to make sure you problem doesn't get lost if you start a new thread. Even though your problems are troubleshooting problems, when we scan back through forum posts to see if there are issues that never got addressed, it is easier to find these issues when each thread deals with a separate issue and has its own title.

OK. Next time I will use a new thread to post.

Thanks.

Virginie (France) CentOS 6.5 - MPSS-3.2 - Xeon Phi 7120P

 

 

hi,

 

i cannot access  to mpss 3.2.1  for windows  (update 10 april 2014)

 

https://software.intel.com/en-us/articles/intel-manycore-platform-softwa...

 

version 3.2.1  for windows

the link seems not working?

 

thanks

bertrand

 

 

 

hi

it is ok   now for downloading  mistake by me

sorry for my useless post(!)

regards

bertrand

we've added a Windows troubleshooting flow to this post - for those who are getting Intel Xeon Phi coprocessors working on Microsoft* Windows.

happy computing!

This comment has been moved to its own thread

sory, come listen and learn with this discussion

Hi,

I have :

                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.4.1
                Device Serial Number     : ADKC32100318

 

Where do I find the new Firmware ? I have looked and searched for it with little success.

Can you provide a web page with the firmware ?

 

Regards,

Ole

 

The firmware is delivered with the MPSS. When you install a new MPSS, one of the instructions in the readme.txt file tells you to update the flash with the micflash command. This will update Flash, SMC Firmware and SMC Boot Loader.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi