• 2019 Update 4
  • 03/20/2019
  • Public Content
Contents

Memory Hierarchy

Intel® Graphics Compute Architecture uses system memory as a compute device memory. Such memory is unified by means of sharing the same DRAM with the CPU. The obvious performance advantage is that shared physical memory enables zero-copy transfers between host CPU and Intel® Graphics OpenCL™ device. The same zero-copy path works for the CPU OpenCL™ device and finally for the CPU-GPU shared context. Refer to the “Mapping Memory Objects" section for more information.
The Compute Architecture memory system is augmented with several levels of caches:
  • Read-only memory path for OpenCL images which includes a level-1 (
    L1
    ) and a level-2 (
    L2
    ) sampler caches. Image writes follow different path (see below);
  • Level-3 (
    L3
    ) data cache is a slice-shared asset. All read and write actions on OpenCL buffers flows through the L3 data cache in units of 64-byte wide cache lines. The L3 cache includes sampler read transactions that are missing in the L1 and L2 sampler caches, and also supports sampler writes. See section “Execution of OpenCL™ Work-Items: the SIMD Machine” for details on slice-shared assets.
  • Note
    The L3 efficiency is highest for accesses that are cache line-aligned and adjacent within cache line
  • Shared Local Memory (SLM)
    is a dedicated structure within the L3 that supports the work-group local memory address space. The read/write bus interface to shared local memory is again 64-bytes-wide. But shared local memory is organized as 16 banks at 4-byte granularity. This organization can yield full bandwidth access for access patterns that may not be 64-byte aligned or that may not be contiguously adjacent in memory.
  • Note
    The amount of SLM is an important limiting factor for the number of work-groups that can be executed simultaneously on the device. Use the
    clGetDeviceInfo(…CL_DEVICE_LOCAL_MEM_SIZE)
    call to query the exact value.
    Note
    As shared local memory is highly banked, it is more important to minimize bank conflicts when accessing local memory than to minimize the number of cache lines.
Finally, the entire architecture interfaces to the rest of the SoC components via a dedicated interface unit called the Graphics Technology Interface (GTI). The rest of SoC memory hierarchy includes the large Last-Level Cache (LLC, which is shared between CPU and GPU), possibly embedded DRAM and finally the system DRAM.
Figure 4. View of memory hierarchy and peak bandwidths (in bytes/cycle) for the Gen7.5 compute architecture (4th Generation Intel® Core™ family of microprocessors).
Please find more details on the memory access in the following sections.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804