Using multiple devices requires creating a separate queue for each device. This section describes potential strategies for work partition between the devices (command queues).
Assigning work statically (according to statically determined relative device speed) might result in lower overall performance. Consider allocating work according to the current load and speed of devices. The speed of a device can be affected by OS or driver scheduling decisions and by dynamic frequency scaling.
There are several approaches to the dynamic scheduling:
- Coarse-grain partitioning of the work between CPU and GPU devices:
- Use the inter-frame load-balancing with the naturally independent data pieces like video frames or multiple image files to distribute them between different devices for processing. This approach minimizes scheduling overheads. However it requires a sufficiently large number of frames. It also might increase a burden to the shared resources, such as shared last-level cache and memory bandwidth.
- Use the intra-frame load-balancing to split between the devices the data that is currently being processed. For example, if it is an input image, the CPU processes its first half, and the GPU processes the rest. The actual splitting ratio should be adjusted dynamically, based on how fast the devices complete the tasks. One specific approach is to keep some sort of performance history for the previous frames. Refer to the dedicated “HDR Tone Mapping for Post Processing using OpenCL - Multi-Device Version” SDK sample for an example.
- Fine-grain partitioning - partitioning into smaller parts that are requested by devices from the pool of remaining work. This partitioning method simulates a “shared queue”. Faster devices request new input faster, resulting in automatic load balancing. The grain size must be large enough to amortize associated overheads from additional scheduling and kernel submission.
When deciding on how to split the data between devices, you should take into account the recommended local and global size granularity of each device. Use sub-resources when performing output to the shared resources by multiple devices.
You can also have a task-parallel scheduler. The approach requires understanding of both: task nature and device capabilities. For example, in the multi-kernel pipeline the first kernel runs on the CPU, which is good for the particular RNG algorithm; the second runs on the GPU, which is good for the specific type of heavy math, such as native_sin, native_cos. This way different pipeline stages are assigned to different devices. Such kind of partitioning may provide performance gain in some custom partitioning schemes, but in general the adaptability of this approach might be limited. For example, if you have just two kernels in the pipeline, you do not have many parameters to tweak, as the scheduler can run either kernel0 on CPU and kernel1 on GPU, or vice versa. It is important to minimize the time one device spends waiting for another to complete the task. One approach is to place a fixed size pool between producers and consumers. In a simple double-buffering scheme, the first buffer is processed, while the second is populated with new input data.