Avoid cache splits on 128-bit unaligned memory accesses with SSE3 Instructions. The Streaming SIMD Extensions (SSE) provides the MOVDQU instruction for loading memory from addresses that are not aligned on 16-byte boundaries. Code sequences that use MOVDQU frequently encounter situations where the source spans across a 64-byte boundary (or cache-line boundary). Loading from a memory address that spans across a cache line boundary causes a hardware stall and degrades software performance.
Use LDDQU, a special 128-bit unaligned load designed to avoid cache-line splits. If the address of the load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes requested. If the address of the load is not aligned on a 16-byte boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately below the address of the load request. It then provides the requested 16 bytes. If the address is aligned on a 16-byte boundary, the effective number of memory requests is implementation-dependent (one or more). Because LDDQU usually accesses more data than is needed (32 bytes when 16 are needed), and because the number of memory accesses is implementation dependent, great care must be taken when dealing with uncached or write-combining (WC) memory regions.
LDDQU is a typed instruction for integer data; it is best used with integer data. Because of implementation issues, restrict the usage of LDDQU to situations where no store-to-load forwarding is expected. For situations where store-to-load forwarding is expected, use regular store/load pairs (either aligned or unaligned, based on the alignment of the data accessed).