User Guide

Contents

Advanced Modeling Options

When you select a
Target System
of
Intel® Xeon Phi™
or
Offload to Intel Xeon Phi
coprocessor, additional modeling parameters appear below
Runtime Modeling
area under
Intel Xeon Phi Advanced Modeling
:
  • Select
    Consider Code Vectorization
    if you agree to modify your parallel code later to improve vector parallel execution. If checked, you can specify:
    • Reference CPU Vectorization Speedup
      you expect can be achieved. This value indicates the speedup multiplier gain for the current site by using vectorization techniques with the reference CPU (dual-socket 8-core
      Intel® Xeon®
      processor E5-26xx product family at 2.7 GHz, 16 cores total). When providing this estimate, base your estimates on target device characteristics and your expertise of
      how much
      and
      how well
      this part of code can be vectorized.
    • Intel Xeon Phi Vectorization Speedup
      you expect can be achieved. This value indicates the speedup multiplier gain for current site by using vectorization techniques with an
      Intel® Xeon Phi™
      processor. When providing this estimate, base your estimates on target device characteristics and your expertise of
      how much
      and
      how well
      this part of code can be vectorized.
  • When you choose
    Target System
    as
    Offload to Intel Xeon Phi
    , you can select the
    Offload Transfer Data Size
    to specify data transfer size value you expect can be achieved (unit is KB).
  • Click
    Apply
    after modifying any of these values.
In some cases, you can restructure your code to enable more efficient vector operations. Loop vectorization allows hardware to process data independently in smaller units (usually 64-byte), such as operations on data arrays.
One way to enable more efficient vector operations is to modify a
single
loop to create a new outer loop where the two loops cover the same iteration space. A technique called strip-mining allows the innermost loop to use vector operations in small chunks.
Other ways to enable more efficient vector operations include examining outermost loops where threading parallelism might already be used, and consider vectorizing its innermost loops and/or callee functions.
Certain innermost loops may benefit from OpenMP 4 constructs. That is, under certain conditions you can use both an
omp parallel for
threading pragma and a
omp simd
(or similar) simd vectorization pragma (see the compiler vectorization report and descriptions at http://openmp.org).
The processor microarchitecture determines the type of vector instructions that will be supported and thus the size of data the hardware can process efficiently (see http://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures).
For a description of the
Intel® Xeon Phi™
coprocessor architecture, visit the Intel® Developer Zone and read such articles as https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-phi-coprocessor-codename-knights-corner.html.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804