Intel® Advisor provides two design tools to help ensure your Fortran, C and C++ native/managed applications realize full performance potential on modern processors, such as Intel® Xeon® and Intel® Xeon Phi™ processors:
The following is a glossary for the Vectorization Advisor. It is a work in progress.
alignment of code: Placement of a contiguous code section (loop or function) in memory such that the address of the first byte is divisible by a power of two. Such a code section is called n-byte aligned.
alignment of data: Placement of contiguous data (such as a variable or C/C++ struct/class) in memory such that the address is divisible by a power of two. You may achieve better performance if data is aligned, at least, to its size.
CPU front-end: A part of CPU core that reads instructions from memory, decodes them, and sends them to the execution core (back-end). Under certain circumstances, the front-end may process too few instructions per clock cycle, which results in under-utilization of the back-end.
FMA: Fused multiply-add instructions that improve the performance and accuracy of floating-point computations. Sample syntax: A = +A * B + C. These instructions are faster because the computation is not performed in steps, and more accurate because intermediate results are treated as infinite precision, with rounding done on store.
peeled loop: A small, (usually) compiler-generated loop created to align the memory accesses inside the loop body and maximize its efficiency. The compiler peels off any initial iterations containing misaligned accesses, which leaves the remaining iterations’ memory accesses optimally aligned. A peeled loop always has a trip count smaller than the vector length.
remainder loop: A (usually) compiler-generated loop created to clean up any remaining iterations that do not fit within the scope of the loop body. The compiler typically generates remainder loops when the source loop trip count is not a multiple of the vector length.
unroll: Optimize a loop by duplicating its body, thus reducing the branching overhead and the number of loop iterations that must execute. A complete unroll fully duplicates the loop body such that no repetition is required. A partial unroll of size n duplicates the body n times and reduces the number of iterations to 1/n of the original iteration count.
vector register width: The number of bits in the processor vector registers. Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions operate on 128-bit registers; Intel® Advanced Vector Extensions (Intel® AVX) instructions operate on 256-bit registers; Intel® Many Integrated Core Instructions (Intel® MIC Instructions) operate on 512-bit registers.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804