Intel 64 and IA-32 Architectures Optimization Reference Manual states ( see 5.1)
"Code sequences containing cross-typed usage produce the same result across
different implementations but incur a significant performance penalty. Using
SSE/SSE2/SSE3/SSSE3 instructions to operate on type-mismatched SIMD data
in the XMM register is strongly discouraged". ( underline is mine ).
Is there exact data of the performance penalties?
Specifically, what would be the penalty of mixing movhlps ( single precision type ) with addsd ( double precision type ) e.g.
How much more efficient would be to use the following instead