I am experiencing some SSE instructions, such as __mm_dp_ps and __mm_blend_ps, performance issues.
The problem is when I move from my old combination of _mm_add_ps, _mm_mult_ps and _mm_shuffle_ps to __mm_dp_ps and __mm_blend_ps, the overall performance is about the same, or slightly worse.
Although the total instruction number has reduced in my example. I had 18 multiply, 13 add, 1 movehl, 1 shuffle before. Now I have 14 dot product. 3 add, 6 blend and 2 shuffles.
This is done in a highly repeated loop. The instruction count is for each iteration. My data accumulation is for 6, so I need to use two registries.
Any help? Thanks.