Selective Use of gatherhint/scatterhint Instructions

By Rakesh Krishnaiyer, published on February 20 , 2014

Compiler Methodology for Intel® MIC Architecture

Selective Use of gatherhint/scatterhint Instructions

Overview

The -qopt-gather-scatter-unroll=<N> compiler option can be used to generate gatherhint/scatterhint instructions supported by the coprocessor.  This is useful if your code is doing non-unit stride accesses and/or uses indirect addressing via pointers or index arrays. 

Topics

Here is the compiler behavior related to gatherhint/scatterhint generation and unrolling of gather/scatter loops:

Behavior in the compiler:

There are no “one-shot” gather/scatter instructions on KNC, so the compiler generates a loop to perform complete gather/scatter. The loop by default looks as follows:

L1:
   gather
   jkz L2
   gather
   jknz L1
L2:

The code above is good for most applications, but for some applications this loop would be faster if it was unrolled, and also different unroll factors may be needed for best performance for different applications. Also, when the loop is unrolled, adding gather/scatter hint instructions before the loop gives additional benefit. Compiler generates an alternate code sequence for gather/scatter with these properties with the option specified here.

For example, if –qopt-gather-scatter-unroll=3 option is specified, instead of the sequence above, compiler will generate the following unrolled version, and also with two gather/scatter hint instructions preceding the loop:

   gather hint
   gather hint
   nop
L1:
   gather
   jkz L2
   gather
   gather
   gather
   jknz L1
L2:

Here the value of N that gives best performance is data-dependent. In cases where the gather/scatter accesses data in a small number of cache-lines (say 1 or 2), the default sequence (using a small value of N) works best. In cases where each individual data item falls in a different cache-line, using a large value of N may be better.

Take Aways

The gatherhint/scatterhint instructions and unrolling of gather/scatter loops are useful for codes with non-unit stride memory accesses, and codes using indirect addressing through pointers or index arrays. Use the compiler option above to tune your application.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™Coprocessors.  The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Advanced MIC Optimizations chapter

1

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804