Application Performance: A Step-by-Step Introduction to Application Tuning with Intel® Compilers
Before you begin performance tuning, you may want to check the correctness of your application by building it without optimization using /Od (Windows*) or -O0 (Linux* or Mac OS* X). In compiler versions 11 and later, all optimization levels assume support for the SSE2 instruction set by default.
Use the general optimization options (Windows /O1, /O2 or /O3; Linux and Mac OS X -O1, -O2, or -O3) and determine which one works best for your application by measuring performance with each. Most users should start at /O2 (–O2), the default, before trying more advanced optimizations. Next, for loop-intensive applications, try /O3 (-O3). These options are available for both Intel® and non-Intel microprocessors but they may perform more optimizations for Intel microprocessors than they perform for non-Intel microprocessors.
Fine-tune performance to target IA-32 and Intel 64-based systems using processor-specific options. Examples are /QxSSE4.2 (–xsse4.2) for the Intel® Core™ processor family, e.g. the Intel Core i7 processor, and /arch:SSE3 (-msse3) for compatible, non-Intel processors that support at least the SSE3 instruction set. Alternatively, you can use /QxHOST (-xhost) which will use the most advanced instruction set for the processor on which you compiled. This option is available for both Intel® and non-Intel microprocessors but it may perform more optimizations for Intel microprocessors than it performs for non-Intel microprocessors. For a more extensive list and description of options that optimize for specific processors or instruction sets, please see the online article “Intel® compiler options for SSE generation and processor-specific optimizations” and the Intel Compiler User and Reference Guides.
3. Add interprocedural optimization (IPO), /Qipo (-ipo) and/or profile-guided optimization (PGO), /Qprof-gen and /Qprof-use (-prof-gen and -prof-use); then measure performance again to determine whether your application benefits from one or both of them.
4. Optimize your application for vector and parallel execution on multi-threaded, multi-core and multi-processor systems using:
advice from the new Guided Auto-Parallelism (GAP) feature, /Qguide (-guide);
the Intel® Cilk™ Plus language extensions for C/C++;
the parallel performance options /Qparallel (-parallel) or /Qopenmp (-openmp);
the CoArray feature of Fortran 2008;
or by using the Intel® Performance Libraries included with the product.
These optimization steps are applicable to both Intel and non-Intel microprocessors, but may result in a greater performance gain on Intel microprocessors than on non-Intel microprocessors.
Use Intel® VTune™ Amplifier XE to help you identify serial and parallel performance “hotspots” so that you know which specific parts of your application could benefit from further tuning. Use Intel® Inspector XE to reduce the time to market for threaded applications by diagnosing memory and threading errors and speeding up the development process. These products cannot be used on non-Intel microprocessors.
For more details, please consult the main product documentation, e.g. in the Intel® Software Documentation Library. A brief summary of the major optimization options of the Intel Compiler is available in the Quick-Reference Guide to Optimization with Intel® Compilers version 12.