History of … one CPU instructions: Part 1. LDDQU/movdqu explained

Once upon the time and back to 2000, Intel brought to market NetBurst microarchitecture (http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29 )  with Pentium 4 CPUs .
At 2004, with its Prescott revision/core and as a part of SSE3 instruction set, we’ve got LDDQU instruction,

Where the main focus area used to be - Video Encoding:
The most compute-intensive part of a video encoder is usually Motion Estimation (ME) where blocks from the
current frame are checked against blocks from the previous frame to find the best match. Many metrics can
be used to define the best match. The most common is the L1 metric: the sum of absolute differences. Due to
the nature of ME, loads of the blocks from the previous frame are unaligned whereas loads of the blocks from
the current frame are aligned. Unaligned loads suffer two penalties:
• cost of handling the unaligned access
• impact of the cache line splits
The NetBurst microarchitecture does not support a uop to load 128-bit unaligned data. For that reason, 128-bit
unaligned load instructions, such as movups and movdqu, are emulated with microcode, using two 64-
bit loads whose results are merged to form the 128-bit result. In addition to the cost of the emulation, unaligned
loads are penalized by the cost of handling cache line splits if the access crosses a 64-byte boundary.
SSE3 adds lddqu to solve the cache line split problem on 128-bit unaligned loads. The instruction works by
loading a 32-byte block aligned on a 16-byte boundary, extracting the 16 bytes corresponding to the unaligned
access. Because the instruction loads more bytes than requested, some usage restrictions apply. Lddqu should
be avoided on Uncached (UC) and Write-Combining (USWC) memory regions. Also, by its implementation,
lddqu should be avoided in situations where store-load forwarding is expected. In load-only situations, and with
memory regions that are not UC or USWC, lddqu can advantageously replace movdqu/movups/movupd.
The code below shows an example of using the new instruction. Both code sequences are similar except that
the load unaligned (movdqu) is replaced by the new unaligned load (lddqu). With the assumption that 25%
of the unaligned loads are across a cache line, the new instruction can improve the performance of ME by up to
30%. MPEG-4 encoders have demonstrated speedups greater than 10%.

Now some code snippet,

Motion Estimator without SSE3:
movdqa xmm0, <current>
movdqu xmm1, <previous>
psadbw xmm0, xmm1
paddw xmm2, xmm0

Motion Estimator with SSE3:
movdqa xmm0, <current>
lddqu xmm1, <previous>
psadbw xmm0, xmm1
paddw xmm2, xmm0

from http://download.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/vol8iss1_art01.pdf

A bit later there happened to be some follow ups, where most noticeable:
/en-us/forums/showthread.php
and
http://x264dev.multimedia.cx/archives/8

so, in summary: starting from Intel Core 2 brand ( Core microarchitecture , from mid 2006, Merom CPU and higher) up to the future: lddqu does the same thing as movdqu

In the other words:
if CPU supports Supplemental Streaming SIMD Extensions 3 (SSSE3) -> lddqu does the same thing as movdqu,
If CPU doesn’t support SSSE3 but supports SSE3 -> go for lddqu
(and note that story about memory types )

And the last point – from the patenting point of view, be aware about patent number: 6721866
http://www.google.com/patents/US6721866
as approach been used is actually protected.

Ultrabooks, on the other side, as been a cutting edge are incline to use even more advanced technology feature, called Quick Sync Video or QSV which is to allows for all related to video decode and encode activities be offloaded from the main CPU to the integrated graphics, meaning be faster or power smarter. 

About development in this area - just note a key link for now : /en-us/articles/vcsource-tools-media-sdk/ 

PS: FYI one and good place for “all Intel’s microarchitectures” view: http://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
如需更全面地了解编译器优化,请参阅优化注意事项