I attempting to use SSE and AVX instructions to optimise my program. I have 3 versions of my code: Scalar, SSE, and AVX.
After much optimization, my SSE version is pretty much 4x as fast as my scalar code. This was actually quite suprising, I did not expect to get so close to 4x improvement.
However, my AVX version is 20% slower than my scalar code!
The program is operating on SoA data, so the difference between SSE and AVX versions is very small (just dividing the upper bound of the loop by 2, and incrementing the pointer by 2x).
If I write a simple test program that sums two arrays, I can indeed see that AVX is 2x as fast as SSE, and 8x as fast as scalar code.
My actual algorithm is pretty benign in terms of instructions, I do not use many exotic instructions. Mostly mulps, addps, and rcpps.
I'm using intrinsic functions in VS2010 SP1, and I have an i5 2500 CPU.
I am wondering if there is something subtle that I might be doing wrong?
Thanks in advance for any ideas.