Out of Order Queues -- do they work? Enqueued Barriers with Events -- very slow?

Out of Order Queues -- do they work? Enqueued Barriers with Events -- very slow?

Two questions:

(1) What is the expected behavior of out-of-order queues on GEN9 + NEO?

I'm issuing a number of small kernels into an out of order command queue with profiling enabled and no barriers between the NDRanges.

I'm not seeing kernels being run concurrently despite each kernel only using a fraction of a sub-slice (3 sub-slices available).

The benchmark is being run for one iteration.  

(2) What is the expected profiling behavior of enqueued barriers?

I'm enqueueing a barrier with no wait list between each NDRange and looking at the start and end time of both the barriers and kernels.

Barriers are reporting immensely long execution times (end - start) ... often in the 6-10 milliseconds when an event is attached.

Furthermore, an enqueued barrier's start time appears to begin before kernels preceding it in the out of order command queue.

This is unintuitive and the durations seem impossibly long. 

But... adding to the confusion, is that the interleaved kernel NDRanges seem to start and end back-to-back with only a few microseconds delay similar to (1).


What am I missing with out-of-order queues on GEN/NEO and are the reported durations of barriers correct?


Each example list the order the command is issued, its type and it's start/end/duration in nanonseconds (via profiling).

Out-of-order queue with no barriers (which is not what I want):

[0  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275065828645407,      275065828856573,               211166
[1  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275065828858044,      275065828867044,                 9000
[2  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275065828867855,      275065828913105,                45250

Out-of-order queue with barriers between kernels but with NULL for the barrier's event

[0  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275228853506912,      275228853721495,               214583
[1  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275228853722460,      275228853732710,                10250
[2  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275228853736931,      275228853776931,                40000

Out-of-order queue with barriers that record an event for profiling:

[0  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275372161923465,      275372162135631,               212166
[1  ] CL_COMPLETE   CL_COMMAND_BARRIER           :      275372158781086,      275372162447953,              3666867
[2  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275372162137683,      275372162146683,                 9000
[3  ] CL_COMPLETE   CL_COMMAND_BARRIER           :      275372158807180,      275372164875723,              6068543
[4  ] CL_COMPLETE   CL_COMMAND_NDRANGE_KERNEL    :      275372162148451,      275372162192534,                44083
[5  ] CL_COMPLETE   CL_COMMAND_BARRIER           :      275372158836095,      275372167017520,              8181425

( Ignore any minor swings in kernel execution time )

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Profiling disables out of order execution.

Generally when you want to see performance change for aggregation, the best way to do this is to measure  wall clock time on CPU.

Any command having an event with profiling will act as a synchronization point.

We have just published a sample that may explain what is happening :


For barrier and profiling that may indicate a bug,barrier is usually being handled by synchronization command which doesn't take long to execute so execution delta should be close to 0,  I suggest a followup on our GitHub:


In general you should observe concurrent execution with out of order queue where you manually insert clEnqueueBarrierWithWaitlist without events as synchronization points.

Perfect... that explains what I'm observing.

Thanks for the quick response.



One last question, is there no other way to measure command queue execution time?  

Can a command queue with profiling enabled impact another command queue without profiling enabled?

For example, if I create two command queues, one cq without profiling enabled executing the kernels of interest and the other with profiling enabled with markers or "noop" commands waiting on non-profiled events in the executing command queue?


Profiling in Neo is per queue not global so there shouldn't be an impact from one queue to another.

Technically profiling through other queue may work, just as you noticed profiling may be incorrect for "marker" like command.

What would work now is clEnqueueNDRangeKernel submission of some one work item kernel doing nothing.

Unfortunately VTune instrumentation is causing serialization in out of order scenarios.

So this appears to be capturing OOQ concurrency:

It looks like I'm benchmarking shorter execution times with an OOQ vs. IOQ.

All kernels and barriers within the foo_kernels_and_barriers() routine use NULL for their event arg.

Does this make sense... or am I imagining it? :)

Yeah this will work.

Glad to hear that you have a boost :)



For 'NEO' Intel Compute Runtime implementation issues, this forum is a good place for discussion... however sightings are well suited to be posted directly on github. https://01.org/compute-runtime. That project is open to the community for contributions and Intel contributors indicated it would be a better portal to triage sightings.


VTune Amplifier 2018 XE can execute without the profiling mode enabled flag set. But it will not give you all the useful data... So I don't think it's suitable for the case described.


This project: cl intercept layer https://github.com/intel/opencl-intercept-layer... maybe another avenue to help you get some performance data.



Leave a Comment

Please sign in to add a comment. Not a member? Join today