my code uses avx most of the time.
however from time to time i have to work with vectors of four floats and use SSE with _mm_sub_ps, _mm_add_ps, ...
for those. I read that there is a huge penealty involved here.
Why is that and how large is that penealty? Should i even use scalar operations for the vec4's instead?
How large is the penealty when i have one function that is already converted to AVX and call an old function using SSE from that (not inlined)?
All in all the speedup of using 64 Bit and AVX is now only about 30 Percent for my app, maybe i can get more out.