This white paper is the fourth in a series of white papers on OpenCL describing how to set up and use events in multithreaded design. This white paper will go over various design choices using OpenCL™ user and command queue-related events for kernels running on CPUs.
The Intel® OpenCL 1.1 specification Beta implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from /en-us/articles/opencl-sdk. It is still evolving into a mature product, so feel free to try it and provide feedback to us in the Intel® OpenCL SDK Support Forum. At present, Intel OpenCL 1.1 only runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).
Intel OpenCL 1.1 events are used primarily to synchronize commands in a context. Event objects can be used to track which one of the four states CL_QUEUED, CL_SUBMITTED, CL_RUNNING and CL_COMPLETE a given command is in for a given command queue. User events are used to trigger processing when host threads detect that certain conditions are met. Since user events can be triggered as and when needed, and commands in the command queue can wait on user events as needed, this makes user events the best way to organize command executions when commands are submitted to multiple command queues or to out-of-order command queues.
Non-user events start with initial state CL_QUEUED, and user events start with CL_SUBMITTED as their initial state. Since the OpenCL specification does not call out specifically what should happen when commands are terminated (behavior is implementation-specific), programmers need to utilize the context creation callback function to handle command termination errors effectively.
For single in-order command queue configuration, events are usually used to synchronize host thread memory management (e.g. managing buffer ownership in CL/GL CL/D3D10 sharing or clearing/recycling buffers) and kernel executions. Since all commands are executed by the command queue in order, there is no need to synchronize commands within the command queue. The host thread may either put clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished), to synchronize memory management with kernel execution. clFinish() is a heavy-handed brute force way to make sure all work is done before proceeding further, as it does not return until all submitted work is done. Command clWaitForEvents() will also block the host thread, but only for commands listed in the event list.
A better way is to set up an event callback at CL_COMPLETE (only available in OpenCL 1.1) which sets up buffers in the callback function as needed when event completion occurs (this may happen asynchronously, so make sure it is thread safe). This will not block the host thread, and the host thread can freely do other tasks at hand.
Fig 1.0 Using Event and Event Callbacks in single in-order Command Queue
For single out-of-order command queue configuration, events are used to synchronize host thread memory management and kernel executions, as well as command execution order as required by algorithms within the command queue. The host thread may use clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished) to synchronize memory management with kernel execution or to explicitly synchronize various set of commands.
Commands clFinish() will block the host thread, and it will not allow execution of other commands which can be executed while waiting for previous commands.
Command clWaitForEvents() is little better, as it will block the host thread, but only for commands listed in the event list. This explicit way of managing commands is not the best way to fully utilize the device.
Developers should submit commands with event wait lists configured as really needed by an algorithm. This way, the command queue has a lot more flexibility in deciding which commands can be executed while others are in pipeline. Here is a simple example of this approach.
Fig 2.0 Managing various kernels using Out-Of-Order queues and Event/Event-wait-lists. Use non-blocking read and writes.
Another even more efficient way is to run commands in separate threads, using event callbacks and user events as shown below.
Fig 3.0 Managing various kernels using Out-Of-Order queues and User Events/Event/Event-wait-lists. Use non-blocking read/writes.
In multi-device/device fission context with multiple command queues, the scheme in Fig 3.0 can be extended to use multiple user events (one per command queue). OpenCL events provide a similar design paradigm as graphical user interface (GUI) design based on events generated by the user. This way, the program performs tasks related only to the event at hand. The main thread can simply set up work related to each event in related event callbacks, and continue to do other work without blocking or waiting.
User events provide a way for the host program to trigger events which are outside the framework of OpenCL commands. OpenCL commands can wait for user event completion before moving forward. This way, a fine-grain control over execution order of various commands can be achieved while managing code complexity.
Profiling Using Events and Event Callbacks
Profiling with events can provide a fine-grain portable way to collect “time taken in nanoseconds”-based data for almost all commands submitted to the command queue.
Unfortunately, commands cannot block in callback functions (no clFinish, clBuildProgram, clWaitForEvents, or any blocking commands), so a developer cannot simply call clGetEventProfilingInfo in event callbacks for non-blocking commands, just as data provided by clGetEventProfilingInfo is only useful once a command is complete.
Profiling data for an event that triggered a callback at completion can be taken without any issue.
Markers provide points of synchronization based on Marker Events. Programmers can use marker-based approaches if there is a need to order kernel executions based on a certain order. Using markers and events, programmers can ensure that kernel1 and kernel2 finish and all data is copied to the host before kernel3 starts executing.
Events provide easy ways to identify commands and provide execution status and profiling information, and can also be used to synchronize commands. Events provide developers with a way to control commands at fine command-level granularity.
Event wait lists ensure the order in which commands need to execute in a command queue. Developers should always profile all commands to see where time gaps exist, and see if they can be filled with other commands. Profiling usually only gives consistent data when used in large loops, so run often and run on various systems to ensure optimal design choices are made.
About the Author
Vinay Awasthi works as an Application Engineer for the Apple* Enabling Team at Intel at Santa Clara. Vinay has a Master’s Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.