Intel® Advisor MAP Gather/Scatter

Intel® AVX-512 Gather/Scatter Profiler

 

Knights Landing introduces an Intel® Advanced Vector Extensions 512 (Intel® AVX-512) v(p)gather instruction that normally provides better effectiveness and wider applicability/flexibility than v(p)gather instructions in Intel® Advanced Vector Extensions 2 (Intel® AVX2) or Knights Corner (which is IMCI ISA-based). Intel AVX-512 gather (and scatter) support various combinations of index vs. offset vs. vector width, and introduce an explicit mask argument. Figure 10.14 provides a typical example of vgather instruction operands and corresponding Intel Intrinsic function syntax.

However, Intel AVX-512 code utilizing v(p)gather (and newly introduced v(p)scatter) instructions still demonstrate substantially worse performance than similar code using contiguous vector data load/store.  While gather/scatter–based vectorized code is faster than its scalar (or Intel AVX2/IMCI) counterpart, it is still wise to look for opportunities to improve or avoid it.

Instruction example (MASM syntax):

vgatherqpd zmm10, k2, zmmword ptr [r15+zmm0*1]   

Corresponding intrinsic function example:

__m512d _mm512_mask_i64gather_pd (__m512d src, __mmask8 k, __m512i vindex, void const* base_addr, int scale)

Operand explanation:

  • zmm10 – destination 512-bit wide register, corresponds to return value of intrinsic function
  • k2 – mask register (write mask) – corresponds to k parameter in intrinsic
  • zmmword ptr [r15+zmm0*1] – source being gathered to dst. Superposition of base_addr, vindex (offsets index) and scale.

Figure 10.14 vgatherqpd instruction and corresponding Intel Intrinsic function

In certain cases, it is possible to improve the performance of gather/scatter vectorized code by replacing gather/scatter instructions with sequences of linear load/store (or more complex memory shuffle/permute manipulations) instructions. From a programmer perspective, it is achievable by adjusting an algorithm and optimizing memory access, explicitly communicating the presence of regular accesses to the compiler, or falling back to intrinsics usage.

Intel Advisor introduces a Gather/Scatter Profiler with a dedicated Gather/Scatter Report and Recommendations to assist with gather/scatter vs. shuffle/load explicit vectorization choices. The Gather/Scatter Profiler:           

  • Detects cases where actual data layout is not truly random.
  • Guides code transformation to avoid or optimize gather/scatter usage.
  • Helps to better characterize vectorization efficiency and hardware utilization of sparse memory access codes, even when there is no room for performance improvements.

Gather/Scatter Report

The compiler normally generates Intel AVX-512 v(p)gather and v(p)scatter instructions in the following cases:

  1. Unrecognized regular pattern. Data is accessed with a unit-stride, uniform, or constant stride pattern, but the compiler cannot detect or prove the data is accessed in a regular manner. This usually happens when index array values (such as array_reference[index[i]] or array_reference[index_function(i)]) are unknown at compile-time. The compiler assumes the worst (random) access pattern, while, in fact, index values have a regular structure. Assumed-shape-arrays processing (discussed in the Digging for Intel AVX-512 speedups with Vectorization Advisor Recommendations section) is another common example.
  2. Regular constant stride. Data is accessed using constant stride with a relatively large stride value. In most cases, the Intel AVX-512 compiler uses gather/scatter instructions to vectorize code of this nature, although in a few special corner cases (especially for small stride values), a modern vectorizing compiler uses an alternate non-gather sequence of instructions.
  3. Truly irregular (or too complex) access pattern. Data in a vectorized loop is accessed in a completely random (unpredictable) manner. In such cases, gather/scatter–based implementation is often the only viable choice for vectorizing code. Here, gather/scatter cannot be replaced with an implementation based on contiguous packed load/store without complete refactoring of code, such as replacing sparse 2D arrays with compressed-sparse-row data representation. Code where the index value is computed using complex non-linear (for example, transcendental) functions also fits into this category.

The Gather/Scatter Profiler automatically recognizes the cases in category #1 and #2, providing information about detected data-layout meta-patterns along with Recommendations about how to transform the code more efficiently. The Vectorization Advisor reports detected meta-patterns in the top part of Gather/Scatter Report (see Figure 10.15). Figures 10.16 and 10.17 summarize list of meta-patterns, corresponding vindex values and Intel OpenMP syntax according to Gather/Scatter Advisor Recommendations.

In addition to meta-pattern detection and appropriate Recommendations, the Gather/Scatter Report provides detailed statistics about horizontal/vertical gather/scatter offset values and mask utilization.

 

Pattern #

Pattern Name

Horizontal Stride Value

Vertical Stride Value

Example of Corresponding Fix(es)

1

Invariant

0

0

OpenMP uniform clause, simd pragma/directive, refactoring

2

Uniform (horizontal invariant)

0

Arbitrary

OpenMP uniform clause, simd pragma/directive

3

Vertical Invariant

Arbitrary

0

OpenMP private clause, simd pragma/directive

4

Unit

 

1 or -1

|Vertical Stride| = Vector Length

OpenMP linear clause, simd pragma/directive

5

Constant

Constant = X

Constant = X*VectorLength

Subject for AoS -> SoA transformation

Meta Patterns and corresponding Recommendations identified by Gather/Scatter Report

Gather/Scatter Meta-patterns and Horizontal/Vertical Stide correspondence

The mask utilization information is relevant for Intel AVX-512 code dealing with multiple branches.  For example, when detecting extremely low mask utilization for gather/scatter in branch_A and extremely high mask utilization for gather/scatter in branch_B, we can communicate observed branch frequency to the compiler using __builtin_expect builtin; this may improve mask computation/predication strategies chosen by the compiler and lead to performance speedups.

Finally, for the truly irregular access pattern cases (in category #3 above), use the Gather/Scatter Report to guide replacement of multiple adjacent gathers with a more efficient single-gather-based implementation. Adjacent gather/scatter is out of scope for this chapter; there is more advanced information about the Gather/Scatter Profiler at the Gather Scatter Blog listed in For More Information at the end of this chapter.


 
For more complete information about compiler optimizations, see our Optimization Notice.