This is a question received by Intel Software Network Support, followed by a series of responses from our Application Engineering team:
Q. I've been a programmer for many years in the compute-intensive field of prepress RIP systems. A new subsystem I've written is an order of magnitude slower than similar previous code. The old code is ALL integer and the new code is mostly floating point (double precision). Can I expect a significant speedup (on 2,3 GHz dual Pentium processors) if I take the trouble to recode replacing floating point with fixed point or integer calculations only? An even mix of multiplications and additions is involved. If yes, how big a speedup is possible? i.e. what is a typical ration of FP mults to Integer mults per second?
A. The following are the responses several of our engineers provided to these questions:
1. Integer operations tend to be faster if you use "normal" code; there's typically 3 integer execution units that can run in parallel while floating point code tends to be far more serialized and has a far less effective "Out of order execution" and parallel behavior as a result (and our chips tend to have only 1 floating point multiplier). Another factor is that floating point numbers are bigger, so take more cache and memory bandwidth; depending on your application, that can be a significant factor.
There is another option next to going back to fixed point integer: SSE. With SSE you trade some precision but you can then do 4 or 8 operations in parallel. Not all algorithms lend themselves to using SSE, so it won't always work. Most compilers have assists for using SSE but in general you'll need to change your code to fit the SSE paradigm; Intel has tools for this sort of analysis. (Arjan Van de Ven)
2. As to the question about integer vs floating point multiply, most CPUs, not only from Intel, provide better performance for floating point than integer multiply. More so, when taking advantage of vectorization.
3. I dont believe it is possible to definitively answer this question given the provided information.
One problem is that the facts provided just dont add up. You indicated the code is similar, but contains floating point and integer and not just integer. If the code is really similar in function and workload, then it should not be an order of magnitude slower (10x) just because it contains floating point. Depending on the processor, one might expect FP code to actually be faster, as has been pointed out.
Some speculations: It could be that the new code is encountering floating point denormal exceptions or other FP exceptions causing significant stalls that might be fixable by changing the FPU control register. FPU exceptions are very painful. Also, we know nothing about memory access patterns and whether the original integer implementation was 16 bit or 32 bit integers. Double precision floating point will incur ~2x the cache misses compared to a 32 bit integer implementation, and ~4x the cache misses relative to a 16 bit integer implementation, both of which would make the differences in integer and floating point execution times almost a moot point depending on the details of the memory access patterns, i.e. number of misses, compared to the computational complexity.
My advice would be to profile the two implementations on a similar workload and then evaluate the hot functions to determine where the time is spent and why (cache misses?, FP denormals?) etc. Going down the path of a recode without understanding the real performance issue is shooting in the dark.(Garrett Drysdale)
4. A single addition to the points above: Once the exceptions are removed and cache utilization is good, the way to achieve the best performance on compute-intensive code is to make sure it is vectorized, i.e. uses packed SSE/2/3 processing. To find out if the compiler vectorizes perf-critical code, one can (in preference order) use vectorization reports generated by the Intel Compiler, look at code disassembly in the Intel VTune(
TM) Performance Analyzer, or check out assembler listings.
From my experience, the Intel Compiler is scoring much better vectorizing FP code, while integer fixed-point code tends to be too cryptic for the compiler to vectorize and optimize. (Alex Klimovitski)
5. The processes and algorithms typical of print/prepress/RIP present challenges that span integer and FP functionality available in present microprocessor implementations. A typical pipeline used in page assembly may include the following operations:
(1) decompression from among a variety of formats (TIFF, JPEG...), resolutions (4-Base, 16-Base...) and encoded color spaces (YCrCb...) with requisite precisions
(2) multiple color space conversions (YCrCb > RGB -> XYZ -> CIE Lab) at requisite precisions
(3) affine transformations (scaling and rotation)
(4) unsharp masking
(5) additive-to-subtractive color space conversion (CIE Lab -> CMYK)
(6) scaling to physical medium
(7) shading to physical medium
(8) half-toning (CYMK to physical medium's channels, i.e. C[l]Y[m]M[n]K[o])
The algorithms and operations implicated in (1) above are well represented in our processors, so not much different here. The chasm begins then with (2) through (8) and optimizations/apportionment of those is complicated. Not all color space conversions enumer
ated in (2) and (5) are linear; CIE Lab involves non-integral powers of component ratios, and so these typically have been implemented with LUT's and tri-linear interpolations. Specifically, it is not trivial to implement these while taking advantage of SIMD/vectorization capabilities. Consider that it is relatively expensive to move data between FP and integer units of our present microarchitectures, but it is necessary when intermediate results are used as operands when generating effective addresses for dependent data. The affine transformations and convolution operations enumerated in (3), (4) and (6) are 2-D re-sampling processes with bi-linear operations on multiple channels that are straightforward enough to implement but complicated by type and precision of component data supplied by previous steps. The shading and half-toning operations to the physical medium enumerated in (7) and (8) are essentially non-linear and logical operations typically implemented using non-FP side of processor. The half-toning operation (8) might be done using FP since it involves multi-channel multi-level error diffusion kernel maintaining at least ~2 decimal places of accuracy but is, in the end, constrained by format of data that is delivered. An understanding of which algorithms comprise your pipeline will help guide you through trade-offs in choosing an implementation, i.e. one that is dominated by algorithm and format conversions to take advantage of our processors relative FP/vectorization strengths vs. one from an optimized fixed-point integer/SIMD lineage. (Antony Bruner)
6. There are cases where integer or fixed point code can be preferred. This includes all operations where you need to convert between FP and integer often, e.g. for table lookups, some graphics algorithms, etc. If code can't be vectorized, fixed point can be better. I'm not very familiar what algorithms are used in RIP, but for Intel Pentium 4 processors, integer addition is 0.5clock, while FP in SSE is 5-6.
For multiplication, it's on par with scalar code, but wins on FP if vectorized, so it also heavily depends on the type of computations. For rendering, where you need to generate coordinates, antialiasing, etc., you'll need to have many conversions to get a natural part of a number.