Integer VS fp performance

Integer VS fp performance

I am writting a program that is intended to run on P4 (Xeon). I would like to know if I would be better of by using real type 4 or integers type 4. Basically I have to do something like 40 multiplies on 200 element vectors, that all fall in a small range, (so I can multiply them by lets say 1000 and I would get the accuracy I need if I were to choose integers) but there is a point that I will need to convert them to reals (unless I start with reals)

More than just getting the right answer, I would like to know what I should take into account given my specific architecture. Differences between the floating point and integer hardware and the cost of converting from one to the other.

Thanks
Joan

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Integer multiply on P4 or Xeon has been quite slow, something on the order of 30 clock cycles latency, compared to about 7 for floating point multiply. If you use IFL or ifc with a P4 option (/G7 or -tpp7 or /QxW or -axW), the compiler will attempt to replace integer multiplication by a constant by a sequence of adds. Integer addition, of course, is faster than floating point addition.

It is quite unusual for fixed point arithmetic based on integer operations to be faster than floating point on any system which has hardware floating point arithmetic. So, there is absolutely no justification for extra progamming effort for fixed point arithmetic.

Conversions between integer and floating point on P4 and Xeon are reasonably efficient if you stick to SSE/SSE2 code, such as IFL/ifc give with the appropriate SSE options. The standard INT() and NINT() conversions for the default x87 floating point mode are extremely slow on P4 and Xeon, as they involve the serialized operations to change control word rounding mode, which go through memory, as do the moves between x87 and integer registers, for a total of 40 or more clock cycles. It's much worse than that with some compilers still in wide use, like MSVC6.

All of this points out that the situation is probably more complicated than you bargained for.

Xeon/P4 have no significant penalty for double precision (real*8, selected_real_kind(15)), compared to single precision, except where you run into cache size limitations. So, if you want more than than the 24 bit significance of single precision, double precision presents no performance obstacle.

However, if you use SSE/SSE2, with -xW, the SIMD instructions can potentially handle 4 single precision reals at once, but only two double precision reals. So if the code is vectorized, single precision is certainly faster.
Consistent with Tim's comments above, the first choice would be to use real arithmetic with -xW. If you write code in such a way that it can be vectorized, (see the User's guide), better still.

Martyn

Yes, my code is in fact vectorized and I use optimization for the P4 arch. Precision is not an issue,so far and it probably won't be with type 4 reals. I read somewhere that there was a part of the P4 chip that was running at twice the processor speed, so I figured it might have been the integer unit. So I guess this rises the question of what is the Rapid execution engine and how do I take advantage of it?

Thanks
Joan

Yes, integer additions and some similar operations are very fast, 2 per cycle. But integer multiplications are much slower, as explained above. Using Streaming SIMD Extensions is your best all-round solution.
This can even be faster than the rapid execution engine for integers. For example, if you add 8 pairs of short integers in a single instruction, this beats adding one such pair in half a cycle. But float-to-integer conversions are also expensive.
The recommendation is to keep it simple, use 32 bit floating point if that gives sufficient accuracy, and enable the SIMD instructions with -xW (or -xK).

Martyn

Leave a Comment

Please sign in to add a comment. Not a member? Join today