Documentation on device fission implementation (recommendation)

Documentation on device fission implementation (recommendation)

While I know it is experimental I was hoping their might be more documentation on Intel's implementation of the device fusion extension to OpenCL. The OpenCL specification for this extension is so minimalistic for a subject which is this complex. If possible it would be nice to have some information on how devices share memory, how the hardware is divided on typical systems, known issues, and potential problems which may arrise. I know you are trying hard to get user feedback on this subject but it is hard to effectively test knowing so little about how the implementation maps to hardware.

If this exists I appoligize for not finding it. The only information I could find was the couple of paragraphs in the user's guide.



3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Hi Jim,

We'll take the feedback regarding lack of proper documentation into consideration. In the meantime, let me try and demystify this feature.

The original idea was just to give the programmer some control over which / how many EUs will execute his OpenCL code. This evolved into a full-fledged "sub-device" notion that lets you do crazy things like recompile the same program with different options for portions of the same root device, but I'm going to focus on what we consider the main use of this feature.

The first case to consider is NUMA systems, where you have several physical CPUs. Our implementation considers these as a single "CPU device", unless you use the device fission extension to partition it (based on "affinity domain: NUMA"), in which case you get a cl_device_id handle for each individual CPU. In the current implementation memory sharing between them isn't optimal (meaning it allocates on a node pretty much randomly, and relies on the QPI to move data if kernels execute on another node), but future versions will, in this case, allocate the physical pages on the region pertaining to that NUMA node, and consequently memory will be moved around as applicable.
In the meantime you can work around this by manually allocating pages on the right node, creating cl_mem objects with the USE_HOST_PTR attribute (please remember proper alignment). The missing bit is how to identify which NUMA node is which cl_device_id (the spec offers no way to do this) but in our implemntation the first ID is node 0 and so forth.

For non-NUMA systems, there is less to consider. Partitioning "equally" is a subset of partitioning by counts, where you give a list of sizes and get cl_device_ids representing subsets of the threading pool of those sizes. This does mean that threads are "stolen" from the root device - on an eight-way machine, if you create a sub-device of size two and allocate resources (create a command queue) on it, and then submit a job to the root-level device (which device id is still valid after being fissioned), only six of the threads will perform work - since the other two have been "allocated" to the sub-device.
The use case for this mode is scenarios where:
a) You have your own dedicated threading model, and want the OpenCL implementation to play nice with yours. Assuming you're not subscribing the full machine (in which case, "Immediate Execution" or "Thread Local Exec" as it will be called in the next version is the extension to use), this is a way to ensure OpenCL only uses the portion you intend it to.
b) Your application needs to supply some level of QoS - occasionally interrupts arrive that need to be dealt with ASAP, in which case you could dedicate a subset of the threads to such interrupts, and use the rest for the standard computation.
Both BY_COUNTS/EQUALLY fission "anonymous" threads - you just say how many, not which. The 1.1 EXT defines another mode, called BY_NAMES where you get to choose the indices of the threads you'd like, but the current version doesn't support that. If this is something you consider useful, please let us know.

Generally the implementation in our SDK is closer to the one detailed in the 1.2 spec than the one in the 1.1 EXT spec - the 1.1 EXT effectively allows dynamically adding and removing devices into an OpenCL context by using the device fission APIs, whereas our implementation and the 1.2 spec for this don't support this kind of use case.

For known issues, please refer to the release notes, they should be up to date. The only thing missing is a note that if you want to ensure release of the threads from a sub-device, you should call clFinish on the command queue pertaining to that sub-device before releasing it.

I hope that cleared things up a bit. Please let us know if you have further questions.

Hi Doron,

Thank you very much for the great information it really does help very very much.


Leave a Comment

Please sign in to add a comment. Not a member? Join today