Selective Use of gatherhint/scatterhint Instructions
The -qopt-gather-scatter-unroll=<N> compiler option can be used to generate gatherhint/scatterhint instructions supported by the coprocessor. This is useful if your code is doing non-unit stride accesses and/or uses indirect addressing via pointers or index arrays.
Here is the compiler behavior related to gatherhint/scatterhint generation and unrolling of gather/scatter loops:
Behavior in the compiler:
There are no “one-shot” gather/scatter instructions on KNC, so the compiler generates a loop to perform complete gather/scatter. The loop by default looks as follows:
L1: gather jkz L2 gather jknz L1 L2:
The code above is good for most applications, but for some applications this loop would be faster if it was unrolled, and also different unroll factors may be needed for best performance for different applications. Also, when the loop is unrolled, adding gather/scatter hint instructions before the loop gives additional benefit. Compiler generates an alternate code sequence for gather/scatter with these properties with the option specified here.
For example, if –qopt-gather-scatter-unroll=3 option is specified, instead of the sequence above, compiler will generate the following unrolled version, and also with two gather/scatter hint instructions preceding the loop:
gather hint gather hint nop L1: gather jkz L2 gather gather gather jknz L1 L2:
Here the value of N that gives best performance is data-dependent. In cases where the gather/scatter accesses data in a small number of cache-lines (say 1 or 2), the default sequence (using a small value of N) works best. In cases where each individual data item falls in a different cache-line, using a large value of N may be better.
The gatherhint/scatterhint instructions and unrolling of gather/scatter loops are useful for codes with non-unit stride memory accesses, and codes using indirect addressing through pointers or index arrays. Use the compiler option above to tune your application.
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™Coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.
Back to Advanced MIC Optimizations chapter
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.