Linux & inter-queues dependency

Linux & inter-queues dependency

Romain Dolbeau's picture

Hello,

I'm trying the Linux version of the SDK with some pre-existing code, and I've encountered a bug (I think). Whenever the code use inter-queues dependency, the code hangs.

Data & kernel commands are enqueued in different command queues (to maximize potential overlapping). Events are generated by the various commands, and queues are made to wait on some of the events to ensure validity. The code use clEnqueueWaitForEvents() & clEnqueueMarker() quite a lot.

If I don't use those calls, then the code works fine (but is wrong, as proper ordering is not enforced anymore). If I do use them, then it locks up. The 'main' thread is sleeping on clFinish():

#####
Thread 1 (Thread 0x7f904a355700 (LWP 27795)):
#0 0x00007f9049c60071 in nanosleep () from /lib/libc.so.6
#1 0x00007f9049c8b534 in usleep () from /lib/libc.so.6
#2 0x00007f9047a2d4a7 in Intel::OpenCL::Framework::OclEvent::WaitYield () from /soft/tools/intel/OpenCL_SDK_1.1_Beta_20110510/usr/lib64/OpenCL/vendors/intel/libintelocl.so
#3 0x00007f9047a2f2a0 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion () from /soft/tools/intel/OpenCL_SDK_1.1_Beta_20110510/usr/lib64/OpenCL/vendors/intel/libintelocl.so
#4 0x00007f9047a299b7 in Intel::OpenCL::Framework::ExecutionModule::Finish () from /soft/tools/intel/OpenCL_SDK_1.1_Beta_20110510/usr/lib64/OpenCL/vendors/intel/libintelocl.so
(...)
#####

while the 'worker' threads are all waiting for something to happen:

#####
Thread 2 (Thread 0x411bd950 (LWP 27798)):
#0 0x00007f9049572d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1 0x00007f9047484589 in tbb::internal::rml::private_worker::run () from /soft/tools/intel/OpenCL_SDK_1.1_Beta_20110510/usr/lib64/OpenCL/vendors/intel/libtbb.so.2
#2 0x00007f90474843c6 in tbb::internal::rml::private_worker::thread_routine () from /soft/tools/intel/OpenCL_SDK_1.1_Beta_20110510/usr/lib64/OpenCL/vendors/intel/libtbb.so.2
---Type to continue, or q to quit---
#3 0x00007f904956efc7 in start_thread () from /lib/libpthread.so.0
#4 0x00007f9049c9164d in clone () from /lib/libc.so.6
#5 0x0000000000000000 in ?? ()
#####

This code is known to work with the AMD SDK (tested on GPU only), NVidia SDK (GPU), and IBM SDK (tested on CPU only), so I believe it isn't the code that is wrong.

Is there some known such limitations with the current Alpha on linux, and if so where are they documented, or is it a bug?

Thanks for your help,

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Doron.Singer's picture
Best Reply

Thank you for the report. One thing to watch out for is that clFinish only clFlush-es the queue you issue it on. Let's say you have two command queues, A and B. You enqueue commands C1 to A and C2 to B. C2 depends on C1. You call clFinish on queue B. You will most probably get a deadlock, since C1 was never submitted for execution, since queue A was not flushed.
A simple way to detect whether this is the issue you're experiencing is to use clWaitForEvents, which is defined to implicitly flush queues which contain commands pertaining to the events in the event waitlist. However, as you've probably seeing in the Optimization Guide, clFinish is preferrable to clWaitForEvent.
So, if this is really the issue that's stalling your program, one of two solutions should be considered:
a) Flushing all the relevant queues before calling clFinish.
b) Using an Out of Order queue instead of multiple In-Order queues. This is highly recommended if you can afford the time to restructure your code, as performance is expected to be better than many in-order queues.

Please let us know if indeed the problem was a missing clFlush - if not, could you post the host code you're using, so we can try reproducing the issue?

Thanks,
Doron Singer

Romain Dolbeau's picture
Quoting Doron.Singer a) Flushing all the relevant queues before calling clFinish.

Whoa, that was fast :-) That was the answer.

I first thought "can't be that", because the code always flush after a command... but indeed, it doesn't flush after one of the clEnqueueMarker(). Adding a clFlush() after that one solves the problem.

I'll have to re-read the specifications to see what it says, and whether my code is legal or not (in the first case, it is still a bug in the SDK). I suspect it's an undefined behavior and I got lucky so far.

Thanks for the help,

Doron Singer (Intel)'s picture

I can save you the read: commands are only guaranteed to have been sent to the executing device after a call to clFlush() returns (or implicit flushing occurs). They're allowed to be executed before that, as well, which seems to be what other vendors are doing.
I'm glad to hear your problem was solved. I'd be interestedabout your experienceif you decide to try switching to an out of order command queue.

Doron

Romain Dolbeau's picture
Quoting Doron Singer I can save you the read: commands are only guaranteed to have been sent to the executing device after a call to clFlush() returns (or implicit flushing occurs). They're allowed to be executed before that, as well, which seems to be what other vendors are doing.
I'm glad to hear your problem was solved. I'd be interestedabout your experienceif you decide to try switching to an out of order command queue.

You're absolutely right ; I have nothing in the code that would cause implicit flushing of the clEnqueueMarker(), so the lock-up is acceptable behavior. In other words : buggy code, working SDK :-)

For point 2 (out-of-order) it seems unlikely at the moment (this code is generated, not written).

Thanks again.

Login to leave a comment.