Am continuing with attempts to write n-way interleaving objects for basic floating point math using SSE & AVX (with a view to additionally supporting the Xeon Phi instruction set when that arrives, & maybe some non x86 SIMD sets like Neon too). The aim is to get good ILP running interleaved data through a long chain of instructions, and being able to tweak both vector width & interleave count to be optimal for the platform. Interleaving is intended to reduce the amount of waiting due to instruction latency.
For this code to run well, the compiler needs to be able to do loop unrolling, RVO, and ideally be able to treat temporary POD object lifetimes in the same way as it treats simple types - storing temporary objects in (sets of) registers rather than on the stack wherever possible.
While Intel's compiler performs better than any other currently on the market (by around 2x, well done all!), the generated assembler code still seems far from optimal, in particular it generates a whole lot more loads/stores than ought to be necessary. (The AVX version of the code is broken in other ways, but I'm concentrating on the SSE4 version for now). I'm not sure if this is an issue with the compiler itself, or whether there are semantics in my code that force it to behave the way it does.
One other issue I'm running in to that I'd like to get an understanding of. The performance of the code improves quite nicely from 1x to 4x interleaving -- the 4xSSE (16-way) interleaved version is about twice as efficient as the 1xSSE (4-way) version. I'd expect the performance to fall off somewhat when going to 8x on x86 - and indeed it does, as the register file is too small to cope, and the benefits of interleaving are lost (4-way is enough to accommodate SSE add/mul instruction latencies, it seems). But x64 has twice the register file (16 SSE registers) and so should offer reasonable efficiency up to 8x interleave. However, it doesn't offer any appreciable advantage over x86 with its 8 registers; going up to 16-way interleave causes drastic deterioration on both x86 & x64.
Example data points - with the current set up, for 32-bit code, I'm getting about 100ms execution time for the 4-way version, 230ms for 8-way, on a 2.8GHz Harpertown Xeon on Win32.
Any tips on how to make this code go faster? Supplied code will compile on Windows, Mac and (probably) Linux.
(Edit - have attached some sample asm output)