Execution of OpenCL™ Work Items: the SIMD Machine

This chapter overviews the Compute Architecture of the Intel® Graphics and its component building blocks. For more details please refer to the references in the See Also section. The Intel Graphics device is equipped with several Execution Units (EUs), while each EU is a multi-threaded SIMD processor. Compiler generates SIMD code to map several work-items to be executed simultaneously within a given hardware thread. The SIMD-width for kernel is a heuristic driven compiler choice. SIMD-8, SIMD-16, SIMD-32 are common SIMD-width examples.

For a given SIMD-width, if all kernel instances within a thread are executing the same instruction, then the SIMD lanes can be maximally utilized. If one or more of the kernel instances choose a divergent branch, then the thread executes the two paths of the branch and merges the results by mask. The EUs branch unit keeps track of such branch divergence and branch nesting.

Command Streamer and Global Thread Dispatcher logic are responsible for thread scheduling; see the part, highlighted with the white dashed line of the Figure 1.

Figure 1

An example product based on Intel Graphics Compute Architecture. To simplify the picture, the low-end instantiation composed of one slice (with just one subslice), in red dashed rectangle, is shown. Together, execution units, subslices, and slices are the modular building blocks that are composed to create many product variants.

The building block of the architecture is the execution unit, commonly abbreviated as just EU. EUs are Simultaneous Multi-Threading (SMT) compute processors that drive multiple issuing of the Single Instruction Multiple Data Arithmetic Logic Units (SIMD). The highly threaded nature of the EUs ensures continuous streams of ready-to-execute instructions, while also enabling latency hiding of longer operations such as memory requests.

A group of EUs constitute a “sub-slice”. The EUs in a sub-slice share:

  • Texture sampler and L1 and L2 texture caches, which are the path for accessing OpenCL images
  • Data port (general memory interface), which is the path for OpenCL buffers
  • Other hardware blocks like instruction cache

Figure 2. Subslice, a cluster of Execution Units, instantiating common Sampler and Data Port units.

In turn, one sub-slice (see red-dashed part of the Figure 1) in the low-end GPUs or more sub-slices (see Figure 3) for a more regular case, constitute the slice that adds L3 cache (for OpenCL buffers), Shared Local Memory (SLM), and Barriers as common assets.

Figure 3. The slice of Intel® Graphics, containing two Subslices. The Slice adds L3 cache, shared local memory, atomics, barriers, and other supporting fixed function.

The number of (sub-) slices and EUs, numbers of samplers, total amount of SLM, and so on depends on SKU and generation of the Intel® Graphics device. You can query these values with the regular clGetDeviceInfo routine, for example, with CL_DEVICE_MAX_COMPUTE_UNITS or other parameters. For details on memory and caches for the Intel Graphics, refer to the "Memory Access Considerations" section.

Given the high number of EUs, multi-threading and SIMD within an EU, is it important to follow the work-group recommendations in order to fully saturate the device. See the "Work-Group Size Recommendations Summary" section for the details.

For further details on the architecture, please refer to the Compute Architecture of Intel Processor Graphics Gen7.5 and Gen8 whitepapers referenced in the See Also section.

See Also

More on the Gen7.5 and Gen8 Compute Architectures: https://software.intel.com/en-us/articles/intel-graphics-developers-guides
Work-Group Size Recommendations Summary
Introduction to OpenCL Code Builder and deep dive to Intel Iris Graphics compute architecture

For more complete information about compiler optimizations, see our Optimization Notice.