Xeon Phi: HW Exception: Segmentation Fault in all examples

Xeon Phi: HW Exception: Segmentation Fault in all examples

Hey,

I just updated my Phi to the latest MPSS version (3.2.1) and also the OpenCL Runtime (14.1) as well as the SDK (2014 4.4.0).

Since then, every OCL example and code will crash when I let it run on the Phi, the CPU works fine as always.

I tried rebooting and everything I could imagine in my situation but I cannot figure out what is going wrong.

The MonteCarlo Example gives me this output:

Build program options: "-D__DO_FLOAT__ -cl-denorms-are-zero -cl-fast-relaxed-math -cl-single-precision-constant -DNSAMP=262144"
*** OPENCL MIC DEVICE HW EXCEPTION ***: Segmentation fault (Address not mapped to object [0xfffffffffffffff8])

BACKTRACE:
/tmp/coi_procs/1/4991/mic_server[0x407132]
/lib64/libpthread.so.0(+0xf4d0)[0x7f588b47d4d0]
/tmp/coi_procs/1/4991/mic_server[0x41e8dd]
/tmp/coi_procs/1/4991/mic_server[0x4223b8]
/tmp/coi_procs/1/4991/mic_server[0x41fced]
/tmp/coi_procs/1/4991/mic_server[0x41e59d]
/tmp/coi_procs/1/4991/mic_server[0x41672d]
/tmp/coi_procs/1/4991/mic_server(copy_program_to_device+0x21)[0x4165f1]
/usr/lib64/libcoi_device.so.0(+0x31ef0)[0x7f588bd2bef0]
/usr/lib64/libcoi_device.so.0(+0x322c3)[0x7f588bd2c2c3]
/usr/lib64/libcoi_device.so.0(+0x326d9)[0x7f588bd2c6d9]
/lib64/libpthread.so.0(+0x7bce)[0x7f588b475bce]
/lib64/libc.so.6(clone+0x6d)[0x7f588a89d1cd]

******************

terminate called after throwing an instance of 'std::runtime_error'
  what():  Segmentation fault
Segmentation fault

System status for the Phi seems ok:

MicCheck 3.2.1-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass

Status: OK
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Fri May 23 16:13:38 2014


	System Info
		HOST OS			: Linux
		OS Version		: 3.0.13-0.27-default
		Driver Version		: 3.2.1-1
		MPSS Version		: 3.2.1
		Host Physical Memory	: 264519 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : 2.1.02.0390
		SMC Firmware Version	 : 1.16.5078
		SMC Boot Loader Version	 : 1.7.4172
		uOS Version 		 : 2.6.38.8+mpss3.2.1
		Device Serial Number 	 : ADKC25104125

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 256 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 225W Passive CS

	Cores
		Total No of Active Cores : 60
		Voltage 		 : 1032000 uV
		Frequency		 : 1052631 kHz

	Thermal
		Fan Speed Control 	 : N/A
		Fan RPM 		 : N/A
		Fan PWM 		 : N/A
		Die Temp		 : 45 C

	GDDR
		GDDR Vendor		 : Elpida
		GDDR Version		 : 0x1
		GDDR Density		 : 2048 Mb
		GDDR Size		 : 7936 MB
		GDDR Technology		 : GDDR5
		GDDR Speed		 : 5.000000 GT/s
		GDDR Frequency		 : 2500000 kHz
		GDDR Voltage		 : 1501000 uV

 

Any advice would be greatly appreciated!

Thanks, Michael

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Michael,

This error message basically says that your application has crashed. This can be caused by many reasons and it's hard to suggest something without looking into the code.

Can you share the source code?

Thanks, Alexey

Hi,

as I said, it's the MonteCarlo Example from the SDK:

https://software.intel.com/en-us/vcsource/samples/monte-carlo/

But it also happens with every other OCL application I tried. All examples work fine on the CPU.

Best, Michael

The release notes for the OpenCL Runtime and the OpenCL SDK have CONFLICTING version requirements for the MPSS, as Michael H. empirically discovered.

In the SDK release notes:

"NOTE: For Intel Xeon Phi coprocessor device support, you must install the 3.2.1 version of Intel MPSS"

In the Runtime release notes:

"NOTE: Using OpenCL Runtime 14.1 with MPSS 3.2.1 is not recommended, as this combination introduces stability issues."

This needs to be resolved for people to use Intel's OpenCL on the Phi with any hope of success. I don't know what to ask my sysadmin to do in this case.

-- Tim

is there any solution/workaround from Intel in development? I mean like downgrading or so? (even though I can't find the old runtimes and SDK anymore...)

I'm experiencing the same problem using OpenCL Runtime 14.1 and MPSS 3.2.1.

Does the above release note mean that with the currently available Intel API it's NOT possible to run OpenCL code on Xeon Phi??

Please do not use MPSS 3.2.1 for OpenCL - it is known not play nice togather. Please roll back to MPSS 3.2 or forward to MPSS 3.2.3 which fixed this inconsistency.

 

I just updated to 3.2.3, reinstalled OCL runtime and SDK and I'm still experiencing the exact same error.. frustrating..

Will now downgrade to 3.2

Hello,

We’ve found a critical issue in the latest release package of the OpenCL runtime for Xeon Phi devices.

We’re currently working to provide a fixed package which will be released soon.

 

We’re truly sorry for the incontinence and will do our best to upload the fixed package as soon as possible.

 

Thanks everyone for the great and important feedbacks,

Uri

This leads me to wonder what kind of QA is being done on this SDK.

Kind Regards,

Aaron

Hi All,

The issue has been fixed and the fixed package can be downloaded. Sorry for any inconvenience.

Thanks,
Raghu

 

Thank you so much, it works now!

Login to leave a comment.