Avoid Needless Synchronization
For better results, avoid explicit command synchronization primitives,
such as
clEnqueueMarker
and Barrier
. Explicit
synchronization commands and event tracking result in cross-module round
trips, which decrease performance. The less you use explicit synchronization
commands, the better the performance is.Use the following techniques to reduce the explicit synchronization:
- Merge kernels whenever possible. It also improves data locality.
- If you need to wait for a kernel to complete execution before reading the resulting buffer, continue execution until you need the first buffer with results.
- If an in-order queue expresses the dependency chain correctly, use it to define a string of dependent kernels. In the in-order execution model, the commands in a command queue are executed in the order of submission, with each command running to completion before the next one begins. This is a typical case for a straightforward processing pipeline. Consider the following:
- Using the blocking OpenCL™ API is more effective than explicit synchronization schemes based on OS synchronization primitives.
- If you are optimizing the kernel pipeline, first measure kernels separately to find the most time-consuming one. Avoid callingclFinishorclWaitForEventsin the final pipeline version frequently after, for example, each kernel invocation. Prefer submitting the whole sequence (to the in-order queue) and issueclFinishonce or wait on the OpenCL event object, which reduces host-device round trips.