• 2019 Update 4
  • 03/20/2019
  • Public Content

Loading and Storing Data in Greatest Chunks

“Saturating” the available memory bandwidth is very important. Bytes data types actually load
data types (
), but also trigger instructions to pack and unpack data. Using
for buffers saves a lot of compute, even if you unpack data manually afterward. In other words, you should avoid using
See the example below:
__kernel void amp (__constant uchar4* src, __global uchar4* dst) … uint4 tempSrc = convert_uint4(src[offset]);//Load one RGBA8 pixel … //some processing … dst[offset] = convert_uchar4(tempDst); }
Consider data accesses by using
data type:
__kernel void amp (__constant uint4* src, __global uint4* dst) … uint4 tempSrc = src[offset]; // Load 4 RGBA8 pixels … //some processing in uint4 uint r0 = (tempSrc.x & 0xff);//Red component of 1st pixel uint r1 = (tempSrc.y & 0xff);//Red component of 2nd pixel … tempSrc.x >>= 8; tempSrc.y >>= 8; … tempSrc.x >>= 8; tempSrc.y >>= 8; … uint a0 = (tempSrc.x & 0xff);// Alpha component of 1st pixel uint a1 = (tempSrc.y & 0xff);// Alpha component of 2nd pixel //any calculations on the individual components … uint4 final = 0; // repack them: final.x = (r0) | ((g0) << 8) | ((b0) << 16) | ((a0) << 16);//first pixel final.y = (r1) | ((g1) << 8) | ((b1) << 16) | ((a1) << 16);//second pixel … dst[offset] = final; }
The global size is 1/4th of the original size in the second example above.
If your kernel operates on floating-point data, consider using
data type, which gets four times as much data in one load. It also helps to ensure that the kernel has enough work to do, amortizing the work-item scheduling overheads.
For the CPU device this optimization is equivalent to explicit (manual) vectorization, see the “Using Vector Data Types” section for more information.
Accessing data in greater chunks can improve the Intel® Graphics device data throughput, but it might slightly reduce the CPU device performance as also explained in the “Using Vector Data Types” section.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804