Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2 (SSE2). An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.
How much of a penalty hit will you experience? Empirical evaluation using a 2.8 Ghz Pentium® 4 processor system shows that an unaligned 16-byte load contained within one cache line (128 bytes) is only moderately slower–about 40%–compared to an aligned access. The cost rises sharply though when the 16-byte chunk crosses a cache line boundary. Such cache line splitting loads can be up to five times slower!
These penalties can sometimes be avoided by forcing 16-byte alignment on the data structures from which the SIMD operands are being drawn. You can do this rather cleanly using either the "__declspec(align(16))" directive for static variables, or the "_mm_malloc()" call for dynamic memory allocation. Both of these language extensions originated with the Intel® Compiler but are also supported in more recent versions of the Microsoft Visual C++* compiler (see the Intel Compiler documentation for details).
Given the nature of a particular algorithm, some misalignments cannot be resolved by any means. Motion estimation or motion compensation algorithms used in a video codec are good examples. These algorithms typically process pixel data in 16x16 chunks, also known as macroblocks. While this matches well with the XMM register size (each pixel is one byte in length), misaligned loads are prevalent because the 16x16 block can reside anywhere within the video frame.
Where unaligned access is unavoidable, several techniques can be employed to minimize the performance loss. This paper will present several guidelines to consider, using a "Quarter Pixel Interpolation" routine to illustrate.
Quarter Pixel Interpolation
Without getting too immersed in the theory of motion estimation, let's go through the algorithm in question: The quarter pixel average is to be computed over a 16x16 block of pixels. You can envision this as a small square shape within a video image displayed on your monitor. For all 256 pixels in the block, an average will be taken between the given pixel, the pixels adjacent to the right and below, and it's diagonal neighbor to the lower right. The four byte-length values are summed and then divided by four to obtain the average. Note that for the pixels on the bottom row and right-most column of the 16x16 block, we need to access data outside the block to obtain the quarter pixel average.
Since we're adding four bytes together, each intermediate sum needs to be represented as a word-length value (two bytes) to avoid saturation. So, the 16-byte XMM registers only allow us to process half a row (eight pixels) at a time.
Figure 1 shows a first attempt at coding this routine using SSE2 assembly. The computational sections are condensed to pseudo-code comments to keep the focus on the memory access pattern. The 'RowLoop' produces one row of quarter pixel averages per iteration and executes a total of sixteen times. The top half of the loop processes the left side of the row, while the bottom processes the right side.
For the left side calculations, the SIMD algorithm requires four vector operands, all represented as packed word values: TopRow[0:7], TopRow[1:8], BottomRow[0:7], & BottomRow[1:8]. Figure 2 should help visualize how they are drawn from the macroblock. These operands are arranged through a series of unpack and shift instructions. Once this organizational overhead is complete, eight quarter pixel averages can be computed in just a handful of SSE2 instructions.
The second half of the loop on the right side of the row works in very similar fashion. The four operands here are TopRow[8:15], TopRow[9:16], BottomRow[8:15], & BottomRow[9:16]. Once the averages from both sides of the row are calculated, they can be packed back into one XMM register as byte values and stored to the destination buffer.
Reduce Your Exposure
If you have experience developing SSE2 assembly code, you may have already identified some inefficiency with the code in Figure 1. In processing the left side of the row, we twice load 16 bytes using MOVDQU even though we only need the first nine bytes of each row to fill out the four vector operands. The MOVDQU instruction is convenient because it brings in all the data at once, but it also fetches seven extra bytes that are eventually discarded.
We can improve upon this by looking at the memory access pattern of the entire loop. The seven leftover bytes from the left side processing would actually give a nice head start to the right side work further down in the loop. With a little tweaking, we can completely remove two of the four MOVDQU instructions from the loop, as shown in Figure 3. The instructions colored green indicate changes from the code in Figure 1. The row data loaded in the top half of the iteration is saved in a register for the bottom, where bytes [8:15] can be used for the right side averaging.
Note that some extra finagling is required to set up the [9:16] operands. The PSRLDQ-PUNPCKHBW sequence shifts zero into the upper byte of the register and unpacks the upper eight bytes to word values. This leaves pixels [9:15] and an empty slot where we need . This is accommodated by MOVZX & PINSRW, which load byte  into a general register, zero extend it to a 32-bit value, and insert it into the final XMM word position to complete the [9:16] operand.
This bit of rework increased performance of the function by 13%, as we reduced the total memory load bandwidth from 64 bytes to 34. While minimizing the number of memory accesses in a tight loop is generally a good idea, it is especially good when accesses are unaligned. A cache line split could be lurking between any two pixels, so this optimization cuts our exposure to that perfor mance killer in half.
Look Between The Loops
The next performance opportunity is not so obvious because it lies between successive passes through the loop. At the close of each iteration, the row pointer (held in register esi) is advanced to the next row of the macroblock. Upon returning to the top of the loop, the bottom row from the previous time through becomes the top row in the ensuing pass. The code then gathers the same vector operands for this row as were gathered just a few cycles ago in the prior iteration. This is very inefficient. With some additional code revision, these operands can be passed to the next iteration, avoiding duplication of work.
Figure 4 contains the modified routine. First, note that the computations are partitioned differently so that we can pass along the largest subexpressions common to successive row averages. These subexpressions are the sum (BottomRow[0:7] + BottomRow[1:8]) for the left side and (BottomRow[8:15] + BottomRow[9:16]) for the right. Instead of doing the entire left side average first, followed by the right side, we compute the four subexpressions drawn from one row, add them to the values carried over from the previous iteration, and divide by four to get the quarter pixel average.
This structure is akin to a software-pipelined loop, with a prolog, kernel, and epilog. The code ahead of the RowLoop constitutes the prolog, loading data for the top row of the macroblock and computing the intermediate sums into xmm0 (left side) &xmm1 (right side). The loop itself is the kernel. Starting at the top, it expects that xmm0&xmm1 contain the intermediate data from the top row. The loop then generates the same intermediates from the bottom row and leaves them in xmm2&xmm3. After computing and storing the quarter pixel average for the row, a couple of MOVDQA instructions transfer the contents of xmm2&xmm3 into xmm0&xmm1, respectively, to set up for the next iteration. An epilog is not coded here but could be accomplished by reducing the loop count to fifteen and unrolling the 16th iteration, minus the two MOVDQA instructions. These are extraneous the last time through since the program flow is not returning to the top of the loop, but the performance impact is negligible and not worth the increase in code size.
This second pass of optimization gave an additional 22% speedup on the function. Much of that is due to removing yet another MOVDQU instruction from the loop.
Hoisting Loads To Hide Memory Latency
We've already eliminated three of four MOVDQU unaligned loads and reduced execution time by 30%. What more can be done? Well, the Intel VTune™ Analyzer i s usually helpful in spotlighting remaining performance opportunities. Figure 5 highlights a time-based sampling session. The red shadow pinpoints a hotspot in between the lone MOVDQU instruction in the loop and the MOVDQA following it. Note that the MOVDQA and almost every other instruction in the loop are dependent on the data being loaded by the MOVDQU. This long dependency chain means that the CPU will have very little work to do while the memory subsystem gathers the data. That is why we see a large collection of samples (187) on the MOVDQA instruction. The processor is stalled here waiting for the MOVDQU to return data.
While the out-of-order execution pipeline strives to hide memory latency, it can always use some help. This is especially true for an unaligned load that may occasionally cross a cache-line boundary. In the quarter pixel averaging loop, we can easily specify what data will be needed in subsequent iterations. Why not give the CPU a head start in fetching them?
This notion is applied in Figure 6, where the code employs a more fine-grained software-pipelining technique to help hide MOVDQU latency. The changes are highlighted in blue. The MOVDQU into xmm2 is moved up into the prolog code, and in its place is a MOVDQU into xmm6 to fetch the row that will be processed in the following iteration. Another MOVDQA is needed at the loop end to transfer xmm6 into xmm2, setting up for the next iteration.
This technique is sometimes referred to as "hoisting" a memory load, and it is effective at hiding latency for unaligned loads as well as aligned accesses that frequently miss in the data cache. By hoisting the MOVDQU one iteration ahead, we give the CPU nearly a loop's worth of instructions to execute in parallel while the misaligned data is obtained. This tacked on another 5% speedup, not trivial when you consider that the technique can be applied to many motion estimation routines in a video codec.
On the last run through the loop, we now have one MOVDQU and three MOVDQA's that are extraneous. These instructions are preparing data for the next iteration, which never occurs. This may provide a little more incentive to include an epilog, but real performance trials didn't indicate any measurable gain from it.
Platform and Technology Discussion Forums
This paper stepped through three successive optimizations on a quarter pixel interpolation routine, each reducing the impact of misaligned memory access. These code changes improved function-level performance by over 35%.
Ardent followers of Intel® chip technology may already know about the LDDQU instruction to be included in Prescott, code name for the new generation of IA-32 processors. LDDQU will provide a faster alternative to MOVDQU when loading data that is not 16-byte aligned. Performance data on LDDQU was unavailable as of April 2003), but nonetheless the techniques outlined in this paper should remain essential to achieving optimum performance.
About the Author
Mike Stoner is a Senior Applications Engineer with Intel's Software Solutions Group. He has been with Intel since 1996, mainly in the role of helping independent software vendors develop optimized code for Intel platforms. Prior to joining Intel Mike received Bachelors and Masters degrees in Electrical Engineering from The Ohio State University. He can be reached at firstname.lastname@example.org.
Figure 1. Original SSE2 code for Quarter Pixel Interpolation
Figure 2. Four SIMD operands for left side average calculation
Figure 3. Reducing the memory load bandwidth of the loop
Figure 4. Reusing intermediate values between loop iterations