I have written quite a lengthy SIMD algorithm to simulataneously process a bunch of 32-bit integers. Most of the algorithm is just adds and shifts, however, there are two por instructions on xmm registers. These two little por instructions drop the efficiency from 2.3 million iterations per second to 1.5 million iterations per second. The addition of even 10 adds and shifts do not drop the efficiency that much. My algorithm without SIMD ran at 1.8 million iterations per second, so with the pors in there I end up losing ground by using SIMD. I always thought or was a fairly fast instruction, much more so than shifts at least. Is there a reason for this kind of drastic effect?
For more complete information about compiler optimizations, see our Optimization Notice.