The following recommendations assume that you already have a working application. Before we start, it’s best to check the correctness of your program by building it without optimization using /Od on Windows* or -Od on Linux* and OS X*. While following the steps, if you learn of a new tool or feature, be sure to follow the links to the summary tools section and from there into the Developer Guides and/or User Reference materials for the specific tool.
1. Establish a baseline
Begin by measuring performance with the basic optimizing compiler options: /O1, /O2 or /O3 on Windows* or -O1, -O2, or -O3 on Linux* and OS X*. Use whichever gives the best performance as your performance baseline. Most users should start with default optimization, O2, before trying optimizations that are more advanced. The O3 option will perform more aggressive loop transformations.
2. Fine-tune performance processor-specific options
The compilers provide options that allow you to specify which processors you would like to target. The /Qx option on Windows* and the –x option on Linux* and OS X* allow you to specify an instruction set to target a particular Intel processor. For non-Intel processors, you may use /arch on Windows* or –m on Linux* or OS X* for similar functionality that ensures you will not use Intel-specific optimizations. The /QxHOST and -xhost options will automatically set the architecture to target based on the host computer used for compilation.
For more information, see the Code Generation Options section of the C++ or Fortran Developer Guide and Reference or see the online article “Intel® Compiler Options for Intel® SSE and Intel® AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2, AVX-512) and processor-specific optimizations”.
3. Add interprocedural optimizations (IPO) and/or profile-guided optimizations (PGO)
Interprocedural optimization is a multi-step process that allows the compiler to analyze your code and apply a wide range of optimizations during compilation and link. To turn on multi-file compilation analysis, use the /Qipo option on Windows* or -ipo option on Linux* and OS X*.
For more information, see the “Interprocedural Optimization (IPO)” and “Interprocedural Optimization (IPO) Options” sections of the Developer Guide and Reference for more information.
Profile-guided Optimization (PGO) improves application performance by shrinking code size, reducing branch mispredictions, and reorganizing code layout to reduce instruction-cache problems. This multi-step process involves:
- Compiling with /Qprof-gen on Windows* or -prof-gen on Linux* and OS X*
- Generating a profile or several of them by running the executable one or more times
- Re-compiling with /Qprof-use on Windows* or -prof-use on Linux* and OS X* to apply optimizations
After trying these options, measure performance again to determine whether your application benefits from one or both of them.
For more information, see the “Profile-Guided Optimization (PGO)” and “Profile Guided Optimization (PGO) Options” sections of the Developer Guide and Reference for more information.
4. Use compiler features and options to automatically detect threading and vectorization opportunities automatically
Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:
- Advice from the Guided Auto-Parallelism (GAP) feature: /Qguide on Windows* or -guide on Linux* and OS X*
- Optimization reports generated with /Qopt-report or -qopt-report
- Parallel performance options: /Qparallel or -parallel and /Qopenmp or –openmp for multi-threaded code based on OpenMP*
- Coarray feature of Fortran 2008
- Intel® Performance Libraries included with the product
For more information, see “Guided Auto Parallelism”, “Automatic Parallelization”, “OpenMP* Support”, and “Coarrays” sections of the C++ or Fortran compiler Developer Guide and Reference and the individual References works for the Intel® Performance Libraries. For more information on optimization reports, visit “Getting the Most out of your Intel® Compiler with the New Optimization Reports”.
5. Use Intel® VTune Amplifier XE and Intel® Advisor to help you identify serial and parallel performance hotspots
Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Advisor to vectorize and thread your code by following an easy optimization workflow.
If you are new to these profilers, you may want to start with the Intel® Performance Snapshot tools which are available as preview features in this release.
Application Performance Snapshot
The Application Performance Snapshot provides a quick view into a non-MPI application's use of available hardware (CPU, FPU, and Memory).
Storage Performance Snapshot
The Storage Performance Snapshot provides an idea of how effectively a workload uses the available storage, CPU, memory, and network.
Intel® Inspector XE
Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. Intel Inspector XE cannot be used on non-Intel microprocessors.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804