I'd like to talk about two weird things in the optimization process of the compiler.
#1 : sqrtsd seems preferred over sqrtpd... (just did a 30% performance boost by forcing the use of 2 sqrtpd instead of the 4 sqrtsd previously generated in my code). I know it's not often that sqrt is used sequentially so I won't mind if that feature doesn't appear. However here comes #2.
#2 : in an intrinsics based function the result must be _mm_store..'d to be returned so I expected some RVO to appear in the assembly in the case the returned value is to be loaded in an xmm just after that. Instead there is a pair of store/load to/from an unused local stack variable that could be simplified by a mov between 2 xmm registers. I find that a bit strange being used to see ICC getting rid of everything it can, making dead code elimination a hell to avoid for performance tests btw :)
Note : the code is compiled with the Ox flag by the last ICC integrated into MSVC2013.
Thank you in advance for your answers.
An Intel fan, trying to optimize even what doesn't need to be optimized.