Problems with new Xeon Phi

Problems with new Xeon Phi

Bild des Benutzers kankamuso

Dear all,

I just received my brand new computer with the Intel Xeon Phi. I followoed the instructions on the MPSS readme and booting files but the card behaves randomly. Sometimes it enters the reset failed mode, sometimes not. It takes a loong time (if ever) to reset using 

[jrbcast@localhost ~]$ sudo micctrl -r

Also, I cannot see anything through the micinfo command whenever the card is on the READY state even when the MPSS service is started as shown:

[jrbcast@localhost ~]$ /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Sun Mar 3 18:49:54 2013

System Info
Host OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : NotAvailable
MPSS Version : 2.1.4982-15
Host Physical Memory : 32829 MB
CPU Family : GenuineIntel Family 6 Model 45 Stepping 7
CPU Speed : 2001.000
Threads per Core : 2

*************************** The information below is not complete **************************
****** Please start the MPSS service and run MicInfo again to view the entire output *******

Device No: 0, Device Name: Intel(R) Xeon Phi(TM) Coprocessor

Version
Flash Version : NotAvailable
UOS Version : NotAvailable
Device Serial Number : NotAvailable

Board
Vendor ID : 8086
Device ID : 2250
SubSystem ID : 2500
Coprocessor Stepping ID : 3
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : NotAvailable
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable

Core
Voltage : NotAvailable
Frequency : NotAvailable

Thermal
Fan Speed Control : NotAvailable
SMC Firmware Version : NotAvailable
FSC Strap : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable

GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable

I just see things when on the ONLINE status and after the MPSS service is started:

[jrbcast@localhost ~]$ sudo service mpss start
Starting MPSS Stack: [ OK ]
mic0: online (mode: linux image: /lib/firmware/mic/uos.img)
[jrbcast@localhost ~]$ /opt/intel/mic/bin/micinfo
MicInfo Utility Log

Created Sun Mar 3 18:50:47 2013

System Info
Host OS : Linux
OS Version : 2.6.32-279.el6.x86_64
Driver Version : 4982-15
MPSS Version : 2.1.4982-15
Host Physical Memory : 32829 MB
CPU Family : GenuineIntel Family 6 Model 45 Stepping 7
CPU Speed : 1200.000
Threads per Core : 2

Device No: 0, Device Name: Intel(R) Xeon Phi(TM) Coprocessor

Version
Flash Version : 2.1.05.0375
UOS Version : 2.6.38.8-g32944d0
Device Serial Number : ADKC25003311

Board
Vendor ID : 8086
Device ID : 2250
SubSystem ID : 2500
Coprocessor Stepping ID : 3
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Core
Total No of Active Cores : 60
Voltage : 1032000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
SMC Firmware Version : 1.7.4172
FSC Strap : 14 MHz
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 81 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1000000 uV

Also, the micsmc command does only show that cards are disconnected !!. I attach a file with the dmesg and all the command sequence I have followed to write this mail.

In another thread I read that these Post codes (obtained through dmesg command) meant something was wrong with the memory. Should I RMA the card?:

MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
MIC 0 Resetting (Post Code 3d)
lo: Disabled Privacy Extensions
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)
MIC 0 Resetting (Post Code 3E)

Thanks in advance,

Jose 

AnhangGröße
Herunterladen mic-problems.txt28.63 KB
3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Kevin Davis (Intel)

Hi Jose, I posted earlier to your post in the collateral thread (http://software.intel.com/en-us/forums/topic/366599) but it is more appropriate to continue the discussion under this post.

I looked through this post and the attachment and do not see anything to suggest the card did not function properly at least when the posted information was collected. The Engineer I inquired with indicated 3E is a normal state the card enters during reset and the time spent in that state varies with different cards; however, from your description it sounds plausible there could be a HW issue with the card.

If you purchased your system through an OEM then contact them to determine if a replacement card can/should be provided.

I realize that is not much help, but I hope it helps some.

Bild des Benutzers kankamuso

Zitat:

Kevin Davis (Intel) schrieb:

Hi Jose, I posted earlier to your post in the collateral thread (http://software.intel.com/en-us/forums/topic/366599) but it is more appropriate to continue the discussion under this post.

I looked through this post and the attachment and do not see anything to suggest the card did not function properly at least when the posted information was collected. The Engineer I inquired with indicated 3E is a normal state the card enters during reset and the time spent in that state varies with different cards; however, from your description it sounds plausible there could be a HW issue with the card.

If you purchased your system through an OEM then contact them to determine if a replacement card can/should be provided.

I realize that is not much help, but I hope it helps some.

thanks Kevin,

I forgot to mention that I am running under CentOS 6.3 with the original kernel, no updates. Nevertheless, after waiting for more than 300 seconds for a restar, and not being able to query through mic info, I am pretty sure someone is not working well. I have already contacted my supplier. Lets ths develops.

bests,

jose

Melden Sie sich an, um einen Kommentar zu hinterlassen.