Performance Tools for Software Developers - SSE generation and processor-specific optimizations continued

This is a continuation of another article discussing processor-specific compiler options.
 

Can I combine the processor values and target more than one processor?
Yes. Using automatic processor dispatch technology, you can combine options to create a binary that potentially has optimized code paths for more than one Intel processor. For example, you could specify:

/QaxSSSE3, SSE4.2, AVX (Windows*) or   -axSSSE3, SSE4.2, AVX (Linux*)
The resulting binary could potentially create 4 code paths for any particular function, including one default code path that would assume only SSE2 support. The compiler will only generate additional code paths if it sees a performance advantage in doing so. Because of this, it is unlikely that for any particular function that you would get as many as 4 code paths.
You can also combine processor dispatch with regular processor targeting options for the default path. For example you could potentially use:
/QaxAVX /arch:SSE3  (Windows)  or     -axavx -msse3  (Linux)

This would potentially create 2 code paths: A code path which would be optimized for the 2nd generation Intel® Core™ processor family with support for Intel® AVX and a default code path that would be optimized for any Intel or compatible, non-Intel processor with SSE3 support.
Note: As a separate code path may be created for each specified processor, the size of the resulting binary may grow and affect the resulting performance. Targeting too many different processors is likely to decrease the performance of your application.
Note: Any additional code paths generated will be executed on Intel processors only. The default code path may execute on both Intel and compatible, non-Intel processors with support for the SSE2 instruction set. The optimization level and instruction set required for the default code path may be modified using the regular processor targeting options /Qx and /arch on Windows (-x and -m on Linux or Mac OS* X). Automatic processor dispatch technology may result in additional optimizations for Intel microprocessors that are not performed for non-Intel microprocessors.

What has changed in version 14.0 from previous releases with respect to these processor-targeting options?

The 14.0 compilers introduced the following new switches:

/QxATOM-SSE4.2 (-xatom_sse4.2 for Linux*) for Intel® Atom™ processors systems with support for Intel® SSE4.2 instructions

/QxATOM-SSSE3 (-xatom_ssse3 for Linux*) for Intel® Atom™ processors systems with support for Intel® SSSE3 instructions

The switches /QxSSSE3_ATOM and /QxSSE3_ATOM (-xatom_ssse3 and -xatom_sse3 for Linux*) were deprecated.

What has changed in version 13.0 and 13.1 from previous releases with respect to these processor-targeting options?

The 13.0 and 13.1 compilers introduced the following new switches:

/QxCORE-AVX2 or /QaxCORE-AVX2 (-xcore-avx2 or -axcore-avx2 for Linux*) for systems with support for Intel® Advanced Vector Extensions

What has changed in version 11.1, 12.0 and 12.1 from previous releases with respect to these processor-targeting options?

The 11.1, 12.0 and 12.1 compilers introduced the following new switches:

/QxAVX or /QaxAVX (-xavx or -axavx for Linux*) for systems with support for Intel® Advanced Vector Extensions

What has changed in version 11.0 from previous releases with respect to these processor-targeting options?
The 11.0 compiler introduced the following new switches:

  • /QxHost (-xHost for Linux* or Mac OS* X) generates instructions for the highest instruction set and processor available on the compilation host machine
  • /QxSSE4.2 or /QaxSSE4.2 (-xSSE4.2 or -axSSE4.2 for Linux*) for systems with SSE4.2 support
  • /QxSSE3-ATOM (-xSSE3-ATOM for Linux) for Intel? Atom? processor and Intel? Centrino? Atom? Processor Technology
  • The new processor default is /arch:SSE2 (Windows*) or -msse2 (Linux*).
  • In 11.0, the new option /QxHost (Windows) or -xHost (Linux or Mac OS X) has been introduced. This selects a processor option appropriate to the compilation host processor. See the compiler documentation for more details.
  • /arch:SSE3 (-msse3 on Linux) will generate optimized code that runs on both Intel and compatible, non-Intel processors with support for at least SSE3.architecture.
  • /arch:IA32 (-mia32 on Linux) will generate optimized code without SSE instructions that will run on older Intel or compatible, non-Intel processors of IA-32 architecture that do not support SSE2 instructions.
 
    • In addition 11.0 introduced a new naming schema for the processor targeting switches. Previous
/QaxKWNOPTS or /QxKWNOPTS (-axKWNOPTS or -xKWNOPTS on Linux) are now /QaxSSE, SSE2, SSE3, SSSE3, SSE4.1 or /QxSSE, SSE2, SSE3, SSSE3, SSE4.1 (-axSSE, SSE2, SSE3, SSSE3, SSE4.1 or -xSSE, SSE2, SSE3, SSSE3, SSE4.1 on Linux)
    • .

 

The instruction set default behavior has changed in 11.0 on Windows* and Linux:

When compiling for the IA-32 architecture, /arch:SSE2 (formerly /QxW) is now the default in 11.0 for Windows, -msse2 (formerly -xW) is the default in 11.0 for Linux. Programs built with /arch:SSE2 (-msse2) in effect require that they be run on a processor that supports at least SSE2 such as Intel(R) Pentium(R) 4 or certain AMD* processors.

Note that this may change floating point results very slightly, since SSE instructions will be used instead of x87 instructions and therefore computations will be done in the declared precision rather than sometimes a higher precision.

All Intel 64 architecture processors support SSE2.

To set the default to generic IA-32 with no SSE support, as in 10.1 and earlier compilers, specify /arch:IA32

How can I generate code that will run optimally on any processor from Intel or AMD*?
The compiler's default optimizations, /O2 (-O2 on Linux and Mac OS), generate optimized code for both Intel and compatible, non-Intel processors of IA-32 or Intel64 architecture that support at least SSE2. In addition, /Qipo (inter-procedural optimization or IPO, -ipo on Linux and Mac OS X), /Qprof_use (profile-guided optimization or PGO, -prof_use), and /O3 (high-level loop/memory optimizations, -O3) can add additional performance for many types of applications. These options are available for both Intel and non-Intel microprocessors but they may result in more optimizations for Intel microprocessors than non-Intel microprocessors.

Why is there a need for a run-time check of the processor in the /Qx[CORE-AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2] ( -x[CORE-AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2] on Linux*), processor-specific options?
These options generate processor-specific instructions, such as Intel® Advanced Vector Extensions, SSE4 Efficient Accelerated String and Text Processing Instructions, SSE4 Vectorizing Compiler and Media Accelerators, SSSE3, SSE3, or SSE2 that may or may not be supported on other Intel and non-Intel processors. The compilers now provide a safeguard for the user to verify that the processor on which the application is running is indeed the processor that was targeted. A run-time check is inserted in the resulting executable that will halt the application with an explanatory message if run on an incompatible processor. Without this run-time check, an application may crash with an illegal instruction fault or silently display unexpected behavior if run on an incompatible processor.

 

Does Intel test the Intel® Compilers on all processor types including non-Intel processors?
We cannot test on all processor and platform combinations, but we do perform extensive testing and benchmarking on many platforms that gives us confidence that our optimizations, such as /O2, /O3, /Qipo, profile guided optimizations using /Qprof_use (-O2, -O3, -ipo, -prof_use on Linux and Mac OS), and other processor independent compiler options, work well on all Intel processors and compatible, non-Intel processors. Processor options /arch:IA32, /arch:SSE2 and /arch:SSE3 (-mia32, -msse2 and -msse3 on Linux)  are tested on various Intel and compatible, non-Intel processors. Intel processor specific compiler options such as /QxCORE-AVX2, /QxAVX, /QxSSE4.2, /QxSSE4.1, /QxSSSE3, /QxSSE3, /QxSSE2  (-xavx, -xsse4.2, -xsse4.1, -xssse3, -xsse3, -xsse2 on Linux) and the corresponding /Qax (-ax on Linux) options are validated only on the corresponding Intel processors.

 

Does Intel offer customer support for Intel® Compilers used on non-Intel processors?
Yes. It is our goal for the Intel compiler to provide competitive performance versus other compilers also on non-Intel processors. Therefore Intel will accept problem reports and fix issues reported on non-Intel processor-based systems.

 

If a user still has an Intel® Pentium® II processor to support, what is Intel's recommendation for using the Intel® Compilers?
With the 11.x and later compilers, we no longer have a Pentium® III processor-specific option. The options /arch:IA32 (-mia32 on Linux*) may be used to generate optimized code without Intel® SSE instructions that will run on Pentium III or older Intel or compatible, non-Intel processors of IA-32 architecture that do not support Intel® SSE2 instructions. The non-processor specific optimizations such as /O2, /O3, /Qipo and profile guided optimizations using /Qprof_use (-O2, -O3, -ipo, -prof_use on Linux* and Mac OS X*) generate optimized code without making use of Intel® Streaming SIMD Extensions ( Intel® SSE). The options /arch:IA32 (-mia32 on Linux) may also be used in conjunction with the automatic processor-dispatch options to generate a default code-path that will work on any IA-32 Intel or compatible non-Intel processor.

 

Where can I find more information on processor-specific optimizations?
For more detail, see the main C++ and Fortran Compiler User and Reference Guides.

 
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.