I'm wondering what are the guidelines on using LDDQU vs. MOVDQU instructions on latest and future Intel CPUs.
I know that during Netburst era LDDQU was supposed to be a more efficient way of loading unaligned data, when the data is not supposed to be modified soon. Later, in Core architectures, MOVDQU was updated to become equivalent to LDDQU. Therefore, the general guideline was to use LDDQU - it would at least be not worse than MOVDQU and on older CPUs it would be faster.
However, in the latest Agner Fog's instruction tables for Skylake I can see that LDDQU has one cycle longer latency compared to MOVDQU, which leads to the following questions:
1. Does this mean that LDDQU is no longer equivalent to MOVDQU? If so, what is the difference?
2. Is this discrepancy an unfortunate (mis-)feature of the Skylake architecture that is intended to be "fixed" in future architectures or the change is permanent?
3. What are the guidelines on choosing one instruction over the other? I'm interested with regard to modern architectures (say, Haswell and later) as well as future CPU architectures.