Application Performance: A Step-by-Step Introduction to Application Tuning with Intel® Compilers
Before you begin performance tuning, you may want to check the correctness of your application by building it without optimization using /Od (Windows*) or -O0 (Linux* or Mac OS* X). All optimization levels assume support for the Intel(R) SSE2 instruction set by default.
Use the general optimization options (Windows /O1, /O2 or /O3; Linux and Mac OS X -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (–O2), the default, before trying more advanced optimizations. Next, for loop-intensive applications, try /O3 (-O3)1.
Fine-tune performance to target IA-32 and Intel 64-based systems using processor-specific options. Examples are /QxAVX (-xavx) for the 2nd- and 3rd-Generation Intel® Core™ processor families, /QxSSE4.2 (–xsse4.2) for the Intel® Core™ processor family and /arch:SSE3 (-msse3) for compatible, non-Intel processors that support at least the SSE3 instruction set. Alternatively, you can use /QxHOST (-xhost), which will use the most advanced instruction set for the processor on which you compiled. For a more extensive list and description of options that optimize for specific processors or instruction sets, please see the online article “Intel® compiler options for SSE generation and processor-specific optimizations” and the Intel Compiler User and Reference Guides1.
Add interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use); then measure performance again to determine whether your application benefits from one or both of them1.
Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:
advice from the Guided Auto-Parallelism (GAP) feature, /Qguide (-guide);
the Intel® Cilk™ Plus language extensions for C/C++;
the parallel performance options /Qparallel (-parallel) or /Qopenmp (-openmp);
the CoArray feature of Fortran 2008;
or by using the Intel® Performance Libraries included with the product1.
Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. These products cannot be used on non-Intel microprocessors.
Please consult the main product documentation for more details: C/C++, Fortran. For tuning applications for Intel® Xeon Phi™ coprocessors, see the references in the section "Options for the Intel® Xeon Phi™ Coprocessor." A brief summary of the major optimization options of the Intel Compiler is available in the Quick-Reference Guide to Optimization with Intel® Compilers version 13.
1Several of these options or optimization features are available for both Intel® and non-Intel microprocessors but they may perform more optimization for Intel microprocessors than they perform for non-Intel microprocessors.