The -qopt-gather-scatter-unroll=<N> compiler option can be used to generate gatherhint/scatterhint instructions supported by the coprocessor. This is useful if your code is doing non-unit stride accesses and/or uses indirect addressing via pointers or index arrays.
Here is the compiler behavior related to gatherhint/scatterhint generation and unrolling of gather/scatter loops:
There are no “one-shot” gather/scatter instructions on KNC, so the compiler generates a loop to perform complete gather/scatter. The loop by default looks as follows:
L1: gather jkz L2 gather jknz L1 L2:
The code above is good for most applications, but for some applications this loop would be faster if it was unrolled, and also different unroll factors may be needed for best performance for different applications. Also, when the loop is unrolled, adding gather/scatter hint instructions before the loop gives additional benefit. Compiler generates an alternate code sequence for gather/scatter with these properties with the option specified here.
For example, if –qopt-gather-scatter-unroll=3 option is specified, instead of the sequence above, compiler will generate the following unrolled version, and also with two gather/scatter hint instructions preceding the loop:
gather hint gather hint nop L1: gather jkz L2 gather gather gather jknz L1 L2:
Here the value of N that gives best performance is data-dependent. In cases where the gather/scatter accesses data in a small number of cache-lines (say 1 or 2), the default sequence (using a small value of N) works best. In cases where each individual data item falls in a different cache-line, using a large value of N may be better.
The gatherhint/scatterhint instructions and unrolling of gather/scatter loops are useful for codes with non-unit stride memory accesses, and codes using indirect addressing through pointers or index arrays. Use the compiler option above to tune your application.
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™Coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.
Back to Advanced MIC Optimizations chapter
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804