por efficiency

por efficiency

I have written quite a lengthy SIMD algorithm to simulataneously process a bunch of 32-bit integers. Most of the algorithm is just adds and shifts, however, there are two por instructions on xmm registers. These two little por instructions drop the efficiency from 2.3 million iterations per second to 1.5 million iterations per second. The addition of even 10 adds and shifts do not drop the efficiency that much. My algorithm without SIMD ran at 1.8 million iterations per second, so with the pors in there I end up losing ground by using SIMD. I always thought or was a fairly fast instruction, much more so than shifts at least. Is there a reason for this kind of drastic effect?

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

We forwarded your question to our Application Engineers, who responded as follows:

The Pentium 4 processor is fairly efficient at hiding instruction latency through pipelining the instructions with its out of order execution engine. The shift and add instructions use different execution units, so there is no conflict for resources between these types of instructions. The por instruction, however, uses the same execution unit as the padd instruction, and this could be causing delays while the execution unit?s pipeline clears. This is especially true if you interleave the padd and por instructions. If there is some way to stack the padd instructions so that they can execute sequentially, then you can take advantage of the pipelining in the processor. If each padd instruction is dependent on the result of a por instruction, then you will have to look at other strategies to streamline the execution.

We hope this is helpful.

Regards,

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 12-07-2005 04:39 PM

Leave a Comment

Please sign in to add a comment. Not a member? Join today