Minimize the latency associated with converting a floating-point number to a 32-bit integer on the Intel® Pentium® 4 and Intel Xeon® processors. This is a common task, which, according to the ANSI C/C++ definitions, should be handled by simply truncating the fractional portion of the number.
Microsoft* Visual C++, version 6.0, handles this operation by issuing a call to the _ftol (float to long int) function, which is included in the Microsoft C runtime library. Indeed, a properly ANSI-compliant integer does appear after you make the call.
That _ftol function is a general-purpose call, which can handle both 32-bit and 64-bit conversion. It temporarily modifies the floating-point rounding mode to perform a straight truncation, does the roundoff and type conversion, and then sets the rounding mode back again to its previous state. This method, according to Intel research, is a long latency operation – much longer than you might think.
To see if your application has float-to-int operations, do one of the following:
- Create and scan an assembly listing, and look for an explicit float-to-int cast that is compiled as a call to _ftol.
- Check the compiler messages to see if there is an implicit cast – that should generate a warning.
Enable the /Qlfist compiler switch or rewrite the operation using SSE2 instructions. The /QIfist compiler switch suppresses the use of _ftol when performing conversions, and makes the operation go much faster. That is the good news. The bad news is that the results may not quite conform to the ANSI C standard, unless you change the default rounding mode to "truncate" using the _controlfp C runtime function in the float.h header. Use this trick with caution, however: Changing that default might have unwanted side effects in your application. So be sure to test the app thoroughly.
See the MSDN decsription of /QIfist (Suppress _ftol)* for complete details.
In order to rewrite the operation using SSE2 instructions for the Intel® NetBurst® microarchitecture, the appropriate instruction is CVTTSS2SI. Intel recommends that you implement that function using intrinsics, rather than inlining the assembly code. The only problem here, of course, is that you will have to set up code branches for older processors, such as the Intel® Pentium® III processor, which lacks the SSE2 extensions.
For more, see Microsoft's online C++ reference on SSE2-based numerical conversions*. Another resource is "Floating-Point Intrinsics Using Streaming SIMD Extensions*."
Turn Performance Killers into Performance Enablers