• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Applying Shared Local Memory

Intel® Graphics device supports the Shared Local Memory (SLM), attributed with
__local
in OpenCL™. This type of memory is well-suited for scatter operations that otherwise are directed to global memory. Copy small table buffers or any buffer data, which is frequently reused, to SLM. Refer to the “Local Memory Consideration” section for more information.
An obvious approach to populate SLM is using the
for
loop. However, this approach is inefficient because this code is executed for every single work-item:
__kernel void foo_SLM_BAD(global int * table, local int * slmTable /*256 entries*/) { //initialize shared local memory (performed for each work-item!) for( uint index = 0; index < 256; index ++ ) slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
The code copies the table over and over again, for every single work-item.
An alternative approach is to keep the
for
loop, but make it start at an index set by getting the local id of the current work-item. Also get the size of the work-group, and use it to increment through the table:
__kernel void foo_SLM_GOOD(global int * table, local int * slmTable /*256 entries*/) { //initialize shared local memory int lidx = get_local_id(0); int size_x = get_local_size(0); for( uint index = lidx; index < 256; index += size_x ) slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
You can further avoid the overhead of copying to SLM. Specifically for the cases, when number of SLM entries equals the number of work-items, every work-item can copy just one table entry. Consider populating SLM this way:
__kernel void foo_SLM_BEST(global int * table, local int * slmTable) { //initialize shared local memory int lidx = get_local_id(0); int lidy = get_local_id(1); int index = lidx + lidy * get_local_size(0); slmTable[index] = table[index]; barrier(CLK_LOCAL_MEM_FENCE);
If the table is smaller than the work-group size, you might use the “min” instruction. If the table is bigger, you might have several code lines that populate SLM at fixed offsets (which actually is unrolling of the original
for
loop). If the table size is not known in advance, you can use a real
for
loop.
Applying SLM can improve the Intel Graphics data throughput considerably, but it might slightly reduce the performance of the CPU OpenCL device, so you can use a separate version of the kernel.
See Also

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804