Intel® AVX-512 Gather/Scatter Profiler
Knights Landing introduces an Intel® Advanced Vector Extensions 512 (Intel® AVX-512) v(p)gather instruction that normally provides better effectiveness and wider applicability/flexibility than v(p)gather instructions in Intel® Advanced Vector Extensions 2 (Intel® AVX2) or Knights Corner (which is IMCI ISA-based). Intel AVX-512 gather (and scatter) support various combinations of index vs. offset vs. vector width, and introduce an explicit mask argument. Figure 10.14 provides a typical example of vgather instruction operands and corresponding Intel Intrinsic function syntax.
However, Intel AVX-512 code utilizing v(p)gather (and newly introduced v(p)scatter) instructions still demonstrate substantially worse performance than similar code using contiguous vector data load/store. While gather/scatter–based vectorized code is faster than its scalar (or Intel AVX2/IMCI) counterpart, it is still wise to look for opportunities to improve or avoid it.
Instruction example (MASM syntax):
vgatherqpd zmm10, k2, zmmword ptr [r15+zmm0*1]
Corresponding intrinsic function example:
__m512d _mm512_mask_i64gather_pd (__m512d src, __mmask8 k, __m512i vindex, void const* base_addr, int scale)
- zmm10 – destination 512-bit wide register, corresponds to return value of intrinsic function
- k2 – mask register (write mask) – corresponds to k parameter in intrinsic
- zmmword ptr [r15+zmm0*1] – source being gathered to dst. Superposition of base_addr, vindex (offsets index) and scale.
Figure 10.14 vgatherqpd instruction and corresponding Intel Intrinsic function
In certain cases, it is possible to improve the performance of gather/scatter vectorized code by replacing gather/scatter instructions with sequences of linear load/store (or more complex memory shuffle/permute manipulations) instructions. From a programmer perspective, it is achievable by adjusting an algorithm and optimizing memory access, explicitly communicating the presence of regular accesses to the compiler, or falling back to intrinsics usage.
Intel Advisor introduces a Gather/Scatter Profiler with a dedicated Gather/Scatter Report and Recommendations to assist with gather/scatter vs. shuffle/load explicit vectorization choices. The Gather/Scatter Profiler:
- Detects cases where actual data layout is not truly random.
- Guides code transformation to avoid or optimize gather/scatter usage.
- Helps to better characterize vectorization efficiency and hardware utilization of sparse memory access codes, even when there is no room for performance improvements.
The compiler normally generates Intel AVX-512 v(p)gather and v(p)scatter instructions in the following cases:
- Unrecognized regular pattern. Data is accessed with a unit-stride, uniform, or constant stride pattern, but the compiler cannot detect or prove the data is accessed in a regular manner. This usually happens when index array values (such as array_reference[index[i]] or array_reference[index_function(i)]) are unknown at compile-time. The compiler assumes the worst (random) access pattern, while, in fact, index values have a regular structure. Assumed-shape-arrays processing (discussed in the Digging for Intel AVX-512 speedups with Vectorization Advisor Recommendations section) is another common example.
- Regular constant stride. Data is accessed using constant stride with a relatively large stride value. In most cases, the Intel AVX-512 compiler uses gather/scatter instructions to vectorize code of this nature, although in a few special corner cases (especially for small stride values), a modern vectorizing compiler uses an alternate non-gather sequence of instructions.
- Truly irregular (or too complex) access pattern. Data in a vectorized loop is accessed in a completely random (unpredictable) manner. In such cases, gather/scatter–based implementation is often the only viable choice for vectorizing code. Here, gather/scatter cannot be replaced with an implementation based on contiguous packed load/store without complete refactoring of code, such as replacing sparse 2D arrays with compressed-sparse-row data representation. Code where the index value is computed using complex non-linear (for example, transcendental) functions also fits into this category.
The Gather/Scatter Profiler automatically recognizes the cases in category #1 and #2, providing information about detected data-layout meta-patterns along with Recommendations about how to transform the code more efficiently. The Vectorization Advisor reports detected meta-patterns in the top part of Gather/Scatter Report (see Figure 10.15). Figures 10.16 and 10.17 summarize list of meta-patterns, corresponding vindex values and Intel OpenMP syntax according to Gather/Scatter Advisor Recommendations.
In addition to meta-pattern detection and appropriate Recommendations, the Gather/Scatter Report provides detailed statistics about horizontal/vertical gather/scatter offset values and mask utilization.
Horizontal Stride Value
Vertical Stride Value
Example of Corresponding Fix(es)
OpenMP uniform clause, simd pragma/directive, refactoring
Uniform (horizontal invariant)
OpenMP uniform clause, simd pragma/directive
OpenMP private clause, simd pragma/directive
1 or -1
|Vertical Stride| = Vector Length
OpenMP linear clause, simd pragma/directive
Constant = X
Constant = X*VectorLength
Subject for AoS -> SoA transformation
Meta Patterns and corresponding Recommendations identified by Gather/Scatter Report
Gather/Scatter Meta-patterns and Horizontal/Vertical Stide correspondence
The mask utilization information is relevant for Intel AVX-512 code dealing with multiple branches. For example, when detecting extremely low mask utilization for gather/scatter in branch_A and extremely high mask utilization for gather/scatter in branch_B, we can communicate observed branch frequency to the compiler using __builtin_expect builtin; this may improve mask computation/predication strategies chosen by the compiler and lead to performance speedups.
Finally, for the truly irregular access pattern cases (in category #3 above), use the Gather/Scatter Report to guide replacement of multiple adjacent gathers with a more efficient single-gather-based implementation. Adjacent gather/scatter is out of scope for this chapter; there is more advanced information about the Gather/Scatter Profiler at the Gather Scatter Blog listed in For More Information at the end of this chapter.