another benchmark: while I was testing compare performance, the next step is to compare branching on compares, so I wanted to show the impact of ptest in comparison to pmovmskb - cmp. But my results show that ptest is slower in almost all cases. See the first page of compare.pdf for the results. I would understand ptest and pmovmskb showing the same speed if both instructions count as being in the "integer domain", therefore both having the same 1 cycle penalty wrt. domain crossing (is this correct?).
Am I understanding correctly, that in principle ptest and pmovmskb execute equally fast and that the cmp-jump can be optimized via macro-fusion so that both vector-branching implementations really are equivalent (except for the one additional GPR that the pmovmskb version requires)? Where then could the difference come from?
(Yes, I will have to try out the simulator. I did not find the time yet to try it.)
(where xmm2 is 0xfffff...)