OpenCL™ Optimization Guide for Intel® Processor Graphics
Intel® Architecture Processors provide performance acceleration using Single Instruction Multiple Data (SIMD) instruction sets, which include:
- Intel Streaming SIMD Extensions (Intel SSE)
- Intel Advanced Vector Extensions (Intel AVX) instructions
- Intel Advanced Vector Extensions 2 (Intel AVX2) instructions
By processing multiple data elements in a single instruction, these ISA extensions enable data parallelism in scientific, engineering, or graphics applications.
When using SIMD instructions, vector registers hold group of data elements of the same data type, such as float or char. The number of data elements that fit in one register depends on the microarchitecture, and on the data type width, for example: starting with the 2nd Generation Intel Core™ Processors, the vector register width is 256 bits. Each vector (YMM) register can store eight float numbers, eight 32-bit integer numbers, and so on.
When using the SPMD technique, the OpenCL™ standard implementation can map the work-items to the hardware according to:
- Scalar code, when work-items execute one-by-one.
- SIMD elements, when several work-items fit in one register to run simultaneously.
The OpenCL Code Builder contains an implicit vectorization module, which implements the method with SIMD elements. Depending on the kernel code, this operation might have some limitations. If the vectorization module optimization is disabled, the SDK uses the method with scalar code.
Parent topic: Coding for the Intel® Architecture Processors