Linux freezes when running Python script using the Inference Engine

Linux freezes when running Python script using the Inference Engine

Hello

we have a difficult to find problem with a Python script, where an CNN for license plate recognition runs using the Inference Engine. We are not sure what causes the problem, but the problem started when upgrading from the alpha version of the Inference Engine to OpenVino 2018R5 (in addition with some code changes and an upgrade of Tensorflow), so a connection to the Inference Engine is possible.

Problem description:

We have a Python script, which runs 3 different CNNs using the Inference Engine from OpenVino 2018R5 on images from Ethernet cameras, which are retrieved with OpenCV VideoCapture. In addition ZMQ is used to pass results to other programs. The used hardware is either an Intel NUC7BNH, an NUC7DNH or an NUC8BEH (on the NUC8 no freeze was observed until now). The OS is an Ubuntu 16.04 (with patched kernel 4.7.0.intel.r5.0 or kernel 4.15.0-15-generic (freezes happen less frequent with kernel 4.15). The script is running multiple times in separated Docker containers together with programs in other docker containers.

What happens is that the Linux freezes randomly after some time (sometimes after a few minutes, sometimes after a few hours but also two are now running for many days without a problem). When it freezes no ACPI shutdown works, the screen freezes and even the Magic SysRq keys have no effect. A strange side effect is that a lot of network traffic is created (so much traffic that the network dies and no PC on the switch can communicate). The logs (kern.log, syslog) show nothing special.

If anyone observed a similar problem or has an idea, what can cause this behavior, please let me know.

Greetings,

Thomas

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Thomas. Seems like your problem has many moving parts ! You mentioned that the problem started happening when you upgraded to OpenVino2018R5, in addition to some code changes and an upgrade of Tensorflow. I think you should perhaps revert to your old configuration (which didn't have this freezing problem) and start adding changes one-by-one in order to detect precisely when the freezing happens.

We finally managed to get a consistently freezing configuration. It's a NUC7i7BNH with Ubuntu 16.04 with kernel 4.7.0.intel.r5.0. We run 2 docker containers, each running only the old, not freezing python script. The only difference to the not freezing configurations is that the NEO driver of OpenVino2018R5 is used instead of the one from the alpha version of the Inference Engine (Note that other configurations freeze as well, but this configuration freezes consistently within 5 minutes).

We also removed the Tensorflow parts of the python script and the message sending via ZMQ, but we still observe the freezes.

However when printing a lot of debug output in the python script no freezes happen (or probably it takes just longer). So it's probably a timing depending problem (Note, in one thread we read from the camera stream with OpenCV video capture and push it on a Queue and in the second thread we run the CNNs). I'll come back with more information, if we are able to produce some meaningful debug output or logs.

The freeze also happens consistently with the OpenVino2018R5 framework. Attached a minimal example, which also freezes (same hardware, OS and driver).

 

from __future__ import unicode_literals

import os
import sys
import threading
import time
import traceback

import cv2
import numpy as np
from queue import Queue, Full

from prod import config
plugin_path = os.path.join(os.getcwd(), "plugins")
sys.path.insert(0, plugin_path)
from inference_engine import IENetwork, IEPlugin


def cam(cam_url, queue):
    print("opening cam: {}".format(cam_url))
    cap = cv2.VideoCapture(cam_url)

    while True:
        ok, frame = cap.read()
        if not ok:
            print("ERROR: could not read from cam")
            break

        try:
            queue.put_nowait((frame, time.time()))
        except Full:
            pass
        except:
            import traceback
            traceback.print_exc()


def worker(queue):
    try:
        import inference_engine
        net = IENetwork(model=os.path.join(str(config.PLATE_DIR), str("pnet.xml")),
                        weights=os.path.join(str(config.PLATE_DIR), str("pnet.bin")))
        plugin = IEPlugin(device=str("GPU"))
        network = plugin.load(network=net)
        while True:
            image, img_timestamp = queue.get()
            if len(image.shape) == 3:
                image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
            image = np.expand_dims(cv2.resize(image, (360, 446)), 0)
            try:
                out = network.infer({"data": image.astype(np.float32)})
            except:
                print "infer failed"
                pass
    except:
        traceback.print_exc()
    finally:
        sys.stderr.write("error in worker")
        sys.stderr.flush()
        os._exit(3)


def main():
    try:
        frame_queue = Queue(maxsize=3)

        cam_thread = threading.Thread(target=cam, args=(config.CAM_STREAM, frame_queue))
        cam_thread.daemon = True

        worker_thread = threading.Thread(target=worker, args=(frame_queue, ))
        worker_thread.daemon = True

        worker_thread.start()
        cam_thread.start()

        while True:
            cam_thread.join(1.0)
            if not cam_thread.is_alive():
                break
    except:
        traceback.print_exc()
    finally:
        sys.stderr.write("error main function (join cam thread)")
        sys.stderr.flush()
        os._exit(2)


if __name__ == "__main__":
    main()

 

Dear Thomas, R5.0.1 is now available. Can you download it and try it - or is that the version you are currently using ? The NEO driver is for OpenCL. That driver is now open source https://github.com/intel/compute-runtime

Perhaps having access to the source codes of NEO will help you narrow down the issue ?

It sounds like you've narrowed it down to a NEO driver issue and the freezing you describe sure sounds like a driver issue.

It would be a good idea to post your inquiry here in the opencl forum.

https://software.intel.com/en-us/forums/opencl

Shubha

Hello Thomas,

Just wondering, is the issue only with GPU device (clDNN plug-in) ?

Could you possibly let us know if you can repro on CPU (mkldnn plug-in) ?

plugin = IEPlugin(device=str("CPU"))

Thank you,

Nikos

 

 Hello Shubha, hello Nikos

R5.0.1 is now available. Can you download it and try it - or is that the version you are currently using ?

R5.0.1 freezes as well. We also tried an old Alpha version, R5.0.0 and a self-compiled version with a newer clDNN version (https://github.com/accessio-gmbh/dldt)

Just wondering, is the issue only with GPU device (clDNN plug-in) ?

Could you possibly let us know if you can repro on CPU (mkldnn plug-in) ?

 CPU mode caused no problems in a 6 hour test run while in GPU mode it freezes within a few minutes.

I also created a thread in the OpenCL forum: https://software.intel.com/en-us/forums/opencl/topic/804936

Hi Thomas, 

Thank you for confirming there are no CPU MKLDNN issues.

As Shubha mentioned clDNN is open source so If you like you could further debug the issue with your debug build of Drop 9.1 clDNN 

git clone https://github.com/intel/clDNN
git checkout 049acb9ddac8d3ca11dae8513b9c16e2b7f9e53a
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug ..
./out/Linux64/Debug/tests64
# copy your new clDNN library and replace old
cp out/Linux64/Debug/libclDNN64.so  ~/some_new_place

Cheers,

nikos

Hi nikos,

Our build uses clDNN Drop 12.1 instead of Drop 9.1 in R5.0.0 (and also in R5.0.1?), but we still observe the same problem.

Here a list of what we tested:

Full software (reading images from rtsp-stream, running multiple CNNs using the Inference Engine (in another thread than the image reading, communication with other parts of the software via ZMQ), running multiple times in separated docker containers

- with Alpha version of the Inference Engine, kernel 4.7.0.intel.r5.0:

   - driver from the Alpha version: no freezes

   - newer driver: freezes

- with OpenVino R5.0.0 and driver from the Alpha version:

   - with kernel 4.7.0.intel.r5.0: freezes

   - with kernel 4.15.0-15-generic on NUC7: freezes, but seems to happen not as often as with kernel 4.7.0.intel.r5.0

- with OpenVino R5.0.0 and newer driver:

   - with kernel 4.7.0.intel.r5.0: freezes

   - with kernel 4.15.0-15-generic on NUC7: freezes, but seems to happen not as often as with kernel 4.7.0.intel.r5.0

   - with kernel 4.15.0-15-generic on NUC8: no freezes observed (yet)

- with OpenVino R5.0.1 or own Inference Engine Build using clDNN Drop 12.1: freezes

- with updated OpenCV (4.0.0) and libraries from the Ubuntu 18.04 repo instead of the older ones from the Ubuntu 16.04 repo: freezes

Minimal script (script above and alterations): not freezing means it did not freeze within a few hours, test were run on the NUC7i7BNH with Ubuntu 18.04 and kernel 4.7.0.intel.r5.0 (We will start testing those with kernel 4.15 today)

- testscript from above, 2 docker containers with one script running per container: freezes within a few minutes)

- testscript from above, only 1 docker container: not freezing

- static image instead of reading from stream: not freezing

- only reading from stream (no CNNs): not freezing

- GPU mode in the inference engine: not freezing

Based on those tests it seems like the problem only happens when combining threads / processes running CNNs with the Inference Engine in GPU with threads reading images from a rtsp-stream using OpenCV. There might be a locking problem, but this is just a wild guess.

We don't have the time (and maybe the skills) to debug the libraries and the driver, but if you want to take a look into it, we can build a docker image for you.

Cheers,

Thomas

Hi Thomas,

Could you let me know if it is still actual for you? I got NUC7i7BNB and would like to try to reproduce your problem. So if this is still actual could you build a docker image and provide access to it?

Thanks,
Stepan.
 

Hi Stepan

The problem is still actual or even more actual now because we now also observed some freezes on NUC8 with kernel 4.15 (which is problematic as the old, non-freezing version of the inference engine with the old driver does not run on the new NUC).

Freezes seem to happen more often on the 4.7.0.intel.r5.0 kernel, so it's probably a good idea to investigate the problem there. However freezes were observed on all our setups now (NUC7i3 and NUC7i7 with both kernel 4.7.0.intel.r5.0 and 4.15, NUC8i3 with kernel 4.15). Freezes also happened with OpenVino 5.0.1 build and different NEO driver versions (18.28.11080, 19.08.12439).

A freezing Docker image can be found here: https://hub.Docker.com/r/accessio/zombie

In the attached archive you can find the Dockerfiles used to create the Docker image:
The Docker image is created by:
1. build with Dockerfile_opencv (adds OpenCV 4.0.1 (freeze also happend when we used an older OpenCV version))
2. build with Dockerfile_caffe (adds Caffe, though this is not used in the freeze scripts)
3. build with Dockerfile_opencl (installs the NEO driver and builds Inference Engine from source (our Inference Engine fork just uses a newer version of clDNN without a "memory leak"))
4. build with Dockerfile (installs some python packages, adds model and source files)

Tests were run with docker-compose.yml provided in the archive (some test with a modified version, as stated). We ran one Docker container, if not stated otherwise.

The scripts stop after some time (except cam0.py) as freezing was observed to be more likely shortly after startup. So either set the restart always flag in the docker-compose.yml or run the container in a loop.

You can also find test scripts other than the one provided with the Docker image in the attached archive:
cam0.py: most similar to our production code, reads images from stream and puts on queue in one thread, gets image from queue and does infer in other thread
    -> freezes when running 4 Docker container with this script
    -> freezes when running this script 4 times within the same Docker container
    -> freezes when running 4 Docker container with this script, Docker with "--privileged=true" and "cap_add: -ALL" and no "/dev/dri" device
    -> freezes when running 4 Docker container with this script, Docker with "--network host"
cam1.py: same as cam0.py, but with 4 image reading threads and 4 inference threads
    -> freezes
cam2.py: does create random image instead of reading from stream, otherwise the same as cam1.py
    -> freezes

cam5.py: random image generation and inference in one thread, no stream reading, no queue
    -> does not freeze (at least within 20 hours)
cam6.py: same code as cam1.py, only that the infer function is never called
    -> does not freeze (at least within 20 hours)
cam7.py: lock around queue.put, queue.get and infer
    -> does not freeze  (at least within 20 hours)
cam8.py: same as cam5.py, but with multiple threads doing image generation and infer
    -> does not freeze (at least within 20 hours)

Thanks for looking into that problem. I hope you can find the problem and a solution for it.

Greetings,

Thomas

 

Edit: updated archive to fix some issues in the scripts

Attachments: 

AttachmentSize
Downloadapplication/zip zombie_files.zip14.91 KB

We found the following problem in /var/log/syslog when booting with drm.debug=0x1e log_buf_len=1M boot parameters. This might or might not be connected to our problem. We are now testing kernel 4.19, which should fix the failed page release.

[   24.479286] ------------[ cut here ]------------                                                                                                                                                             
[   24.479289] Failed to release pages: bind_count=1, pages_pin_count=1, pin_global=0                                                                                   
[   24.479398] WARNING: CPU: 0 PID: 205 at /build/linux-uQJ2um/linux-4.15.0/drivers/gpu/drm/i915/i915_gem_userptr.c:89 cancel_userptr+0xe8/0xf0 [i915]                                                                                             
[   24.479399] Modules linked in: veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc snd_hda_codec_hdmi aufs overlay ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf
_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG intel_rapl x86_pkg_temp_thermal intel_powerclamp xt_limit coretemp xt_tcpudp xt_addrtype kvm_intel kvm irqbypass crct10dif_pclmul snd_soc_skl crc32_pclmul snd_soc_skl_ipc arc4 ghash_cl
mulni_intel snd_hda_ext_core pcbc snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi snd_soc_core snd_compress aesni_intel ac97_bus snd_pcm_dmaengine aes_x86_64 btusb crypto_simd rtsx_pci_ms btrtl glue_helper snd_hda_codec_realtek iwlmvm snd_hda_codec_generic cryptd mac80211
[   24.479431]  nf_conntrack_ipv4 btbcm nf_defrag_ipv4 input_leds xt_conntrack memstick btintel bluetooth intel_cstate iwlwifi intel_rapl_perf ecdh_generic snd_hda_intel wmi_bmof intel_wmi_thunderbolt snd_hda_codec ip6table_filter ip6_tables snd_hda_core snd_hwdep snd_pcm s
nd_timer snd soundcore mei_me nf_conntrack_netbios_ns nf_conntrack_broadcast cfg80211 acpi_pad shpchp mei intel_pch_thermal mac_hid nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack libcrc32c sch_fq_codel iptable_filter ip_tables x_tables autofs4 hid_generic usbhid hid i915 i
2c_algo_bit rtsx_pci_sdmmc e1000e drm_kms_helper syscopyarea sysfillrect ptp sysimgblt pps_core fb_sys_fops rtsx_pci drm ahci libahci wmi video                                                                                
[   24.479463] CPU: 0 PID: 205 Comm: kworker/u8:3 Not tainted 4.15.0-45-generic #48-Ubuntu                                              
[   24.479464] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0066.2019.0225.1641 02/25/2019                 
[   24.479485] Workqueue: i915-userptr-release cancel_userptr [i915]                                                                                                                                            
[   24.479502] RIP: 0010:cancel_userptr+0xe8/0xf0 [i915]                                                                             
[   24.479503] RSP: 0018:ffffb20a80a7fe60 EFLAGS: 00010282                                                                          
[   24.479504] RAX: 0000000000000000 RBX: ffff88e8f2448000 RCX: 0000000000000006                                                     
[   24.479505] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff88e8fdc16490                                                     
[   24.479506] RBP: ffffb20a80a7fe78 R08: 0000000000000001 R09: 0000000000000884                                                    
[   24.479507] R10: fffff88345c66b00 R11: 0000000000000000 R12: ffff88e8f24481a8                                                    
[   24.479507] R13: 0000000000000000 R14: ffff88e8f1991500 R15: 0000000000000000                                                 
[   24.479509] FS:  0000000000000000(0000) GS:ffff88e8fdc00000(0000) knlGS:0000000000000000                                     
[   24.479510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                            
[   24.479510] CR2: 00007f5f93e90f18 CR3: 000000001300a006 CR4: 00000000003606f0                                                   
[   24.479511] Call Trace:                                                                                                                                       
[   24.479517]  process_one_work+0x1de/0x410                                                                                       
[   24.479519]  worker_thread+0x32/0x410                                                                                        
[   24.479521]  kthread+0x121/0x140                                                                                                        
[   24.479522]  ? process_one_work+0x410/0x410                                                                                 
[   24.479524]  ? kthread_create_worker_on_cpu+0x70/0x70                                                                  
[   24.479527]  ret_from_fork+0x35/0x40                                                                                                   
[   24.479528] Code: bf 46 ff ff eb c9 8b 93 c8 01 00 00 8b 8b a4 01 00 00 48 c7 c7 40 04 5f c0 8b b3 9c 01 00 00 c6 05 b5 56 10 00 01 e8 e8 6a 17 f5 <0f> 0b eb bc 0f 1f 40 00 0f 1f 44 00 00 55 ba 08 00 00 00 48 89 
[   24.479553] ---[ end trace e5e74074bfad4c2e ]---  

We just had a freeze with a NUC8i3 and kernel 4.19. So the newer kernel does not fix the problem.

Do you have any updates on the problem? Did you try the docker image and is it also freezing for you?

Greetings,

Thomas

Hi Thomas,

Sorry for the late reply. Yes, I used your image to setup system. And I got a question about cam2.py script – according to my experiment the generated images are not suitable for the network provided with the image – I see  for the following error:
    could not broadcast input array from shape (1,446,360) into shape (1,3,62,62)

As I understood you use several CNN and it looks like the script was tested with different CNN (which wasn’t included into the image). Is that OK to update the script according to network requirements? I mean can the update affect the problem reproducibility?

Minor question: docker-compose.yml mentions several env. files and volumes – are they needed to setup correct environment?

Thanks,
Stepan.

Hi Stepan,

there happened some mistakes when adapting the "zombie-scripts" for the zombie docker container. I attached a archive with the fixed scripts and docker-compose.yml (fixed shapes and unnecessary stuff removed from the docker-compose.yml)

Thank you for looking into our problem,

Thomas

Attachments: 

AttachmentSize
Downloadapplication/zip zombie_files_fixed.zip14.91 KB

Hi Thomas,

OK, currently I’m running 4 Docker containers with cam2.py script. As you specified in readme.txt - the containers are run in a loop.
In order to understand your reproducer better – could you clarify why do you need counter (count variable) which stops cam thread after 30*90 iterations?

BTW, how much RAM do you have on your NUCs?

Thanks,
Stepan.

Hi Stepan,

We observed that the freeze happens more likely shortly after starting a container, so the counter was introduced to trigger container restarts.

We have 4GB RAM.

Greetings,

Thomas.

Hi Thomas,

Script cam2.py is running for several hours and I don’t observe any freeze. However, my NUC has 16GB of RAM. As I can see - running 4 Docker containers takes approx. 10GB of RAM. As far as I know a lack of memory may be a reason of GPU hang and reset – in such case dmesg provides info like the following:
[11014.517959] [drm] GPU HANG: ecode 9:0:0x8ed1fff2, in sample [5587], reason: No progress on rcs0, action: reset
[11014.517978] i915 0000:00:02.0: Resetting rcs0 after gpu hang

Of cause I’m not sure that this is your case (taking into account that you didn’t have such issue before software upgrade) So let me try to launch more Docker containers.

If you have an opportunity to increase RAM size on your NUCs then it would be great to know your results

Thanks,
Stepan

Hi Stepan,

For us one container uses about 150MB RAM. So memory is no problem.

Which kernel version do you use? The consistent freezes only happen with kernel 4.7.0.intel.r5.0. On other kernels it may take a few days to freeze.

Currently we have a suspicion that the problem has a connection with the sleep state of the CPU. We are currently testing with "intel_idle.max_cstate=1" on kernel 4.19 and for a week we had no problems (though we observed no freezes with the NUC8 for over a month, so we cannot say this fixes the problem).

We also made a linux bug report (https://bugs.freedesktop.org/show_bug.cgi?id=110334) and are currently testing also with kernel 5.1.0-rc4.

Greetings,
Thomas.

Hi Thomas,

I’m testing kernel 4.15. OK, will try kernel 4.7.0.intel.r5.0

Thanks,
Stepan.

hi Thomas,

Currently we're running cam2.py for about a week, no freeze observed. Maybe you can provide some more info? Like bios settings, env output before running docker, docker version (the one we use is 18.09.6, build 481bc77), maybe even the exact command line for running docker 
After 2 days we have increased number of running dockers from 4 to 16, but it still didn't help.
We're using NUC7i7BNH (KBL i7-7567U), Ubuntu 16.04.6 LTS x86_64 with 4.7.0.intel.r5.0, we have 16 GB RAM, but I can confirm that even with 16 containers only about 2,5GB is used.

Thanks,
Sergey.

Hi,

sorry for not answering for some time. We thought "intel_idle.max_cstate=1" on kernel 4.19 fixed the problem however recently we observed the problem again.

If you want to look into the problem again, we prepared an image of the whole 120GB SSD, which produces the freeze consistently at a NUC7i3DNH and a NUC7i5DNH within an hour. We also observed a freeze on a NUC7i7BNH with this image however it took more time to freeze there.

Here you can find the image: https://drive.google.com/file/d/1gGz-92hfzjaDLK1IC0kQYchIuUimonSv/view?usp=sharing
We flashed the image on a 120GB SSD with: gunzip -c ZombieDisk.iso.gz | dd of=/dev/sda

username: alpr
password: dev

To start the "Zombie"-Creation just execute "./startthezombie.sh" in the home directory. You don't need network or anything else for this.

In the attached zip-archive are our BIOS settings (images of the settings and a profile-file, which may be possible to be loaded).

If you have any question, feel free to ask.

Greetings,
Thomas

Attachments: 

AttachmentSize
Downloadapplication/zip BIOS_SETTINGS.zip620.35 KB

Hi,

at the moment we are trying with "acpi=off" kernel parameter. So far (half a week on 4 NUCs) we got no zombie. This looks promising but other kernel parameters achieved the same and zombied later.

We also found another problem (probably related) that processes using OpenCL get stuck in kernel space. The system is still responsive in this case for some time when not doing anything. However the system might become not responsive anymore (e.g by starting more containeres, maybe it's triggered by a lot of disk read/write?), with a high load (low CPU usage, so maybe the load is created by disk usage). Attached is the output of the syslog (you can see there where the processes stopped).

Regards,

Thomas

Attachments: 

AttachmentSize
Downloadtext/plain crash_syslog.txt19.97 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today