Floating-Point Performance and Vectorization


Challenge

Obtain excellent floating-point performance. Application and source-level optimizations in this area can help to ensure that an application's overall performance is aided as much as possible by floating-point performance.


Solution

Enable the compiler's use of SIMD instructions with appropriate switches, and make hand-coded SIMD optimizations as appropriate. These switches help your application take advantage of the SIMD capabilities of Streaming SIMD Extensions (SSE), and Streaming SIMD Extensions 2 (SSE2) instructions.

Use the smallest possible floating-point or SIMD data type, to enable more parallelism with the use of a (longer) SIMD vector. For example, use single precision instead of double precision where possible, and use short integers instead of long integers. The integer instructions of the SIMD extensions are primarily targeted for 16-bit operands. Not all of the operators are supported for 32 bits, meaning that some source code will not be able to be vectorized unless smaller operands are used.

Arrange the nesting of loops so that the innermost nesting level is free of inter-iteration dependencies. Especially avoid the case where the store of data in an earlier iteration happens lexically after the load of that data in a future iteration, which is called a lexically backward dependence.

Avoid the use of conditionals inside loops, and try to keep induction (loop) variable expressions simple. Also, try to replace pointer with arrays and indices. Avoid denormalized input values, denormalized output values, and explicit constants that could cause denormal exceptions. Out-of-range numbers cause very high overhead.

Do not use double precision unless necessary. Set the precision-control (PC) field in the x87 FPU control word to "Single Precision". This allows single precision (32-bit) computation to complete faster on some operations (for example, divides due to early out). Be careful of introducing more than a total of two values for the floating point control word, however, or there will be a large performance penalty.

Dependence chains can sometimes impact performance by introducing artificial dependencies that are an artifact of how an expression is written and not true data dependencies. For best performance, break depend ce chains where possible. The following example shows an example dependence chain and a simple rewrite that helps overall performance and parallelism:

To calculate z = a + b + c + d, instead of

   x = a + b;
   y = x + c;
   z = y + d;

 

   x = a + b;
   y = c + d;
   z = x + y;

 

In some cases, complete vectorization is not possible, and you may want to include hand-coded SIMD instructions for the best possible performance. There are several excellent resources on the Intel® Developer Zone and Intel Resources for Hardware Developers to help you create optimized SIMD code that can help to significantly improve performance on CPU-intensive code.

To help reduce the impact of denormal input or output when using assembly or assembly language intrinsics, be sure to enable flush-to-zero mode and DAZ (denormals are zero) mode as described on page 2-58 of the Inte® 64 and IA-32 Architectures Optimization Reference Manual.

In addition, be sure to use the fast float-to-int instructions cvttss2si and cvttsd2si if coding with SSE2 (Streaming SIMD Extensions 2).


Source

Optimizing Software for Intel® Centrino™ Mobile Technology and Intel® NetBurst™ Microarchitecture

 


For more complete information about compiler optimizations, see our Optimization Notice.