I have yet to figure out the point of these instructions. I have tested them in many many cases, mostly to attempt to replace the following structure:
movq xmmreg1, memory1
movq xmmreg2, memory2
punpcklbw xmmreg1, zeroreg
punpcklbw xmmreg2, zeroreg
psubw xmmreg1, xmmreg2
with
pmovzxbw xmmreg1, memory1
pmovzxbw xmmreg2, memory2
psubw xmmreg1, xmmreg2
However, in every single case it is almost universally slower or the same speed, despite it being fewer instructions and despite mubench listing pmovzx as a 1-latency 1-throughput instruction.
What is supposed to be the use of this instruction if it is slower than movq/punpck? Or, if its not supposed to be slower than movq/punpck, why might it be slower in these cases?



