Simple floating-point to integer conversions may seem harmless when viewed in C code, but when compiled using the Microsoft* Visual C++* 6.0 compiler they may cause large, unnecessary performance decreases on the Intel® Pentium® 4 processor. This paper will give an overview of how to detect these conversions using Intel® VTune™ Performance Analyzer and how to subsequently fix them using compiler switches or alternate coding options. The solutions provided here are very low-impact and do not require switching to a different compiler. If you are developing a floating-point application, this may be a quick way to gain dramatic performance improvements from your code.
A Closer Look at Float-to-Int Conversions
The problem with casting from floating-point numbers to 32-bit integers stems from the ANSI C standard, which states the conversion should be effected by truncating the fractional portion of the number and retaining the integer result. Because of this, whenever the Microsoft Visual C++ 6.0 compiler encounters an (int) or a (long) cast, it inserts a call to the _ftol C run-time function. This function modifies the floating-point rounding mode to 'truncate', performs the conversion, and then resets the rounding mode to its original state prior to the cast. This code sequence is a detriment to application performance because modifying the rounding mode requires use of the long-latency FLDCW (Floating Point Load Control Word) instruction. The full assembly code listing for the _ftol function is given in Figure 1.
Figure 1: Disassembly for the _ftol function
In addition to the cumbersome control word manipulation, the closing code to set up the return value induces a "store-forwarding" violation. This violation is caused by the sequence starting with the FISTP instruction, which saves eight bytes to memory. This is followed by separate 4-byte moves of the same data into the eax & edx registers. Because the first move is aligned to the same address as the preceding store, the hardware is able to forward the data directly. However, the second load to edx cannot be forwarded because it is not aligned to the same address as the store. The exact clock cycle penalty associated with this type of store-forwarding violation cannot be given here due to Intel confidentiality issues, but in general these sequences will lead to less than optimal behavior within the Pentium 4 processor and should be avoided whenever possible. The _ftol code was written this way so that it could be used to convert to __int64 types as well as 32-bit integer types, but the extra versatility comes at a great cost in performance. For more on store-forwarding issues, see the Intel® Pentium® 4 Processor Optimization Reference Manual.
Detecting FTOL in Your Code
So we've established that _ftol calls are slow, but how do you find t hem in your application? The answer is to use Intel® Vtune™ Performance Analyzer. If you are not already familiar with this powerful profiling tool, you can find more information at:
Intel® VTune™ Performance Analyzer 9.0 for Windows* - Overview.
Once you have established some basic skills in the environment, you can attempt to detect _ftol calls by initiating a time-based sampling session. If you are linking with the static Microsoft C-runtime library, you will see samples in the _ftol function within your own application module. Conversely, if you are using the dynamic C-runtime library, you will see the _ftol samples in "msvcrt.dll". In either case you will then want to figure out exactly where the casts are within your code. The Intel® VTune™ Performance Analyzer Call Graphing feature is designed to help you to accomplish this. The results of the Call Graph run will show which functions in your application called _ftol. It will also display the number of times it was called from each call site, so you will know what code to focus on first.
When examining the code at the _ftol call site, you may find an explicit integer cast that is being compiled as a call to _ftol. You can check this by creating an assembly listing, or debugging into that code and looking at the disassembly view. If you don't see anything that obvious in the source code, you might have an implicit cast in effect. This would cause a compiler warning, with the accompanying message "conversion from 'float' to 'int', possible loss of data." Whether implicit or explicit, the resulting assembly code would include a call to _ftol.
A Quick Fix
The quickest way to remove the _ftol overhead is to apply the /Qifist switch with the Microsoft compiler. This switch tells the compiler not to change the rounding mode for float-to-int casts. So, in place of the _ftol call, you will see a single FISTP instruction to convert the top-of-stack value from floating-point to integer, save the integer to memory, and pop the floating-point value off the stack.
While this will greatly improve performance, it does not guarantee conformance with the ANSI C standard. The choice of rounding mode is left to the programmer and can be conveniently set using the _controlfp() C-runtime call. This function is declared in "float.h". This header also contains some #define constants to be used in the _controlfp() argument list. So, a call to set the rounding mode to truncation might look like this:
Just insert this code at the top of your main module and your program results should be the same with or without /Qifist (assuming the rounding mode is not permanently modified somewhere else in your application). Note that the default setting upon launching a process is round-to-nearest. The FP control word register is part of the thread state, so you don't need to worry about other applications altering the rounding mode and affecting your program results.
For Maximum Performance...
While the /Qifist switch is very convenient, you may g et even better performance by employing the Streaming SIMD Extensions (SSE) and SSE2 conversion instructions. Specifically, the instructions CVTTSS2SI and CVTTSD2SI perform conversion with truncation for floats and doubles, respectively. Rather than inlining assembly code, these instructions are best implemented with intrinsic functions, as shown in Figure 2.
Figure 2: SSE & SSE2 conversion intrinsics
Here, the "x" variable holds a floating-point value, and "i" is of type int or long. This code will only compile if the Microsoft Visual C++ Processor Pack* (found in Service Pack 5) has been installed and the "emmintrin.h" header is specified as an #include.
Most applications still require support for the Intel Pentium® III Processor (which does not have SSE2) and the Intel Pentium® II Processor (which has neither SSE nor SSE2), so a generic code path will be necessary. This is usually facilitated by doing a CPUID check at the beginning of a program and setting global variables to indicate the presence of SSE & SSE2. You can then place a conditional check on these global variables ahead of your optimized code to determine which version should be executed. For best results, try to place this check outside of any tight loops that contain SSE or SSE2 code. Doing these checks repeatedly inside your inner loops may greatly reduce the benefit of the optimizations.
Because of differences in the precision limits of the x87 FP unit and the XMM registers used for SSE/SSE2 computations, the results will not necessarily match when rounding numbers with a fractional component very close to 1.0 (i.e., 0.999999999...). Consider the case where the number to be rounded is 5.99999999. Since the x87 stack represents all numbers at 80-bit extended precision, the _ftol call will process this number directly on the stack without losing any precision, giving a truncated result of 5. However, if the CVTTSS2SI sequence is used, the number will be saved off the stack to memory as a 32-bit single precision value and then reloaded into an XMM register. During the memory store, the number will be rounded to 6.0, since 5.99999999 cannot be represented with only 32 bits of precision. So, the subsequent conversion to integer will give a result of 6. This scenario also exists when saving to the double data type, although it takes several more "9's" in the fraction to exceed the limits of 64-bit precision. Because of this issue, the SSE/SSE2 conversions should not be used in a mission-critical application unless it is fully validated for these outlying cases.
Also note that the SSE/SSE2 instructions cannot be used to convert to either unsigned 32-bit integers or 64-bit integer data types. The /Qifist switch is still the best option for handling these cases.
Table 1 gives a performance breakdown for both the /Qifist switch and the SSE/SSE2 conversions, as compared to the standard _ftol function. These measurements were taken on a 1.7 Gh z Intel Pentium® 4 processor system.
The testcase consisted of a loop that converts an array of 500 floating-point numbers, saving each result into an array of 500 integers. The input array was preloaded with pseudo-random numbers in the range of a signed 32-bit integer.
|Speedup relative to _ftol|
Table 1: Pentium® 4 Processor performance
So, the /Qifist switch provides a healthy 5x gain, but even greater performance is possible with the SSE/SSE2 conversions (assuming the SSE/SSE2 capability check is coded efficiently).
This paper presented two methods for avoiding the _ftol C run-time call. Hopefully, you can employ one of these options gracefully in your application. The C run-time library may eventually be updated with an improved _ftol function, but until then software developers will have to be proactive to eliminate the undesirable conversion overhead.
About the Author
Mike Stoner is a Senior Applications Engineer with Intel's Software Solutions Group. He has been with Intel since 1996, mainly in the role of helping independent software vendors develop optimized code for Intel platforms. Prior to joining Intel Mike received Bachelor's and Master's degrees in Electrical Engineering from The Ohio State University. He can be reached at firstname.lastname@example.org.