Linux freezes when running Python script using the OpenVino Inference Engine in GPU mode

Linux freezes when running Python script using the OpenVino Inference Engine in GPU mode

Hello

we have a problem, where a python script using the OpenVino Inference Engine in GPU mode causes Linux to enter a "Zombie-Mode", where the PC does not react to anything (no ACPI shutdown works, the screen freezes and even the Magic SysRq keys have no effect) and more network traffic than 100MBit switches can handle is created. After some investigation a problem with the Intel GPU driver seems likely. See https://software.intel.com/en-us/node/804844?page=0

Problem description:

We are running the attached python script multiple times in separated Docker containers. The used hardware is either an Intel NUC7BNH, an NUC7i7DNH or an NUC8BEH (on the NUC8 no freeze was observed until now). The OS is an Ubuntu 16.04 (with patched kernel 4.7.0.intel.r5.0 or kernel 4.15.0-15-generic (freezes happen less frequent with kernel 4.15).

What happens is that the Linux freezes randomly after some time (with the NUC7i7DNH and the patched kernel 4.7.0.intel.r5.0 it happens after a few minutes, with the 4.15 kernel freezes it takes a few hours or even days until the freeze happens). When it freezes no ACPI shutdown works, the screen freezes and even the Magic SysRq keys have no effect. A strange side effect is that a lot of network traffic is created (so much traffic that the network dies and no PC on the switch can communicate). The logs (kern.log, syslog) show nothing special.

If anyone observed a similar problem or has an idea, what can cause this behaviour, please let me know.

Greetings,

Thomas

AttachmentSize
Downloadtext/plain script.txt2.25 KB
3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I see that patched kernel is being used. (  patched kernel 4.7.0.intel.r5.0 )

Can you try with plain kernel without any patches?

Driver works best on plain Linux kernels, adding some legacy patches may result in problems that you observe.

 

 

Hi Michal,

the OpenVino Inference Engine needs either kernel 4.14 or newer or a patched kernel. As we want to avoid remote kernel update we tested with the patched kernel. However the problem exists also with kernel 4.15.0-15-generic.

Here a list of what we tested:

Full software (reading images from rtsp-stream, running multiple CNNs using the Inference Engine (in another thread than the image reading, communication with other parts of the software via ZMQ), running multiple times in separated docker containers

- with Alpha version of the Inference Engine, kernel 4.7.0.intel.r5.0:

   - driver from the Alpha version: no freezes

   - newer driver: freezes

- with OpenVino R5.0.0 and driver from the Alpha version:

   - with kernel 4.7.0.intel.r5.0: freezes

   - with kernel 4.15.0-15-generic on NUC7: freezes, but seems to happen not as often as with kernel 4.7.0.intel.r5.0

- with OpenVino R5.0.0 and newer driver:

   - with kernel 4.7.0.intel.r5.0: freezes

   - with kernel 4.15.0-15-generic on NUC7: freezes, but seems to happen not as often as with kernel 4.7.0.intel.r5.0

   - with kernel 4.15.0-15-generic on NUC8: no freezes observed (yet)

- with OpenVino R5.0.1 or own Inference Engine Build using clDNN Drop 12.1: freezes

- with updated OpenCV (4.0.0) and libraries from the Ubuntu 18.04 repo instead of the older ones from the Ubuntu 16.04 repo: freezes

Minimal script (script above and alterations): not freezing means it did not freeze within a few hours, test were run on the NUC7i7BNH with Ubuntu 18.04 and kernel 4.7.0.intel.r5.0 (We will start testing those with kernel 4.15 today)

- testscript from above, 2 docker containers with one script running per container: freezes within a few minutes)

- testscript from above, only 1 docker container: not freezing

- static image instead of reading from stream: not freezing

- only reading from stream (no CNNs): not freezing

- GPU mode in the inference engine: not freezing

Based on those tests it seems like the problem only happens when combining threads / processes running CNNs with the Inference Engine in GPU with threads reading images from a rtsp-stream using OpenCV. There might be a locking problem, but this is just a wild guess.

If someone wants to take a look into it, we can build a docker image for you.

Cheers,

Thomas

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today