Introduction
The media extensions to the Intel Architecture (IA) instruction set include single-instruction, multiple-data (SIMD) instructions. Streaming SIMD Extensions 2 (SSE2) instructions extend SIMD for the Intel NetBurst® microarchitecture with 144 new instructions. This paper describes an SSE2 implementation to calculate the absolute difference between two 16x16 blocks of pixels. This paper also compares the performance gains of that solution to non-SSE2 implementations. This SSE2 solution can be an integral part of a motion-estimation kernel.
Motion Estimation
One important technique used in video compression is to predict movement between consecutive frames. In many cases, the visual representation of a moving object stays the same from frame to frame, merely changing position within the viewing field. A substantially-improved compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object separately in each frame. Calculating these vectors is called motion estimation, and it requires the calculation of an absolute difference between blocks of frames.
Absolute Difference
The motion-estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks and finds the best match. In MPEG1, for example, the calculation can be made in four ways. Either the absolute difference between the pixel values (L1 norm) or the square of the differences (L2 Norm) is summed. Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted, either by some number of whole pixel positions or by some number of half-pixel positions.
A C code example for a 16x16 pixel full pel motion estimation using absolute differences is given in Example 1. The code has a fast-out, so that if the difference so far accumulated across a subset of rows is more than the best match so far, then it aborts the rest of the absolute-difference calculation.
Example 1: C Code to Perform 16x16 Pixel-Block Motion Estimation
char *btpr; /* pointer to start row of 16x16 pixel block being compressed */
char *cptr; /* pointer to start row of 16x16 pixel reference block */
val = 0;
for (i=0; i<16; i++) {
for (j=0; j<16; j++) {
data = (*(bptr++)- *(cptr++));
if (data<0){val -= data;}
else {val += data;}
}
/* Fast out after this row if best match has been exceeded */
if (val > best_value) break;
/* Update pointer to next row */
bptr += (rm->width - 16);
/* Update pointer
to next row */
cptr += (cm->width - 16);
}
Optimized SSE2 Code
The flow of coding the motion-estimation inner loop using SSE2 instructions is shown in Figure 1. The operation uses a psubusb (packed byte subtract unsigned with saturation) to generate the absolute differences without requiring 16-bit precision.
Usually, subtraction of two 8-bit unsigned numbers produces a 9-bit signed result. Thus, in order to retain precision, it may seem that a conversation to 16 bits is needed before the subtract operation, but such conversion is unnecessary when using the psubusb instruction. By subtracting source 2 from source 1 and then performing the opposite operation (on a copy of one of the sources), each result register has the absolute difference value for the case when a respective element in source 1 was larger than a respective element in source 2, and zero otherwise (because a negative result saturates to zero). The same result occurs for the opposite subtraction, which generates the absolute differences in those cases where the elements in source 2 were larger than the elements in source 1.
Performing an OR operation on the results of the two subtractions generates the eight desired absolute-difference results. This technique allows the absolute difference to be performed in byte precision, which is substantially faster than if it were done in 16 bit precision.
Figure 1. Motion Estimation Inner Loop Using Packed Data
The SSE2 code for the inner-most computation of absolute difference is presented in Example 2. This code does not include the fast-out option that was incorporated into the C code of Example 1. In general, this fast-out is not relevant after each 16x1 estimation, as the overhead cost of adding the fast-out for every iteration of the SSE2 instruction loop is prohibitive.
Example 2. Inner-Core Absolute Differences for 16x16 Pixel Block
__asm {
movdqu xmm0, [m1]
movdqu xmm1, [m2]
movdqa xmm2, xmm0
psubusb xmm0, xmm1
psubusb xmm1, xmm2
por xmm0, xmm1
movdqa xmm1, xmm0
punpcklbw xmm0, xmm6
punpcklbw xmm1, xmm6
movdqa xmm3, xmm1
pshufd xmm1, xmm0, 238
pshufd xmm3, xmm0, 68
paddw xmm1, xmm3
movdqa xmm4, xmm1
pshufd xmm4, xmm4, 78
paddw xmm1, xmm4
}
Optimized Scalar Implementation
It is possible to perform an absolute-difference calculation using the same technique without SSE2 instructions, but only on one data element at a time, as opposed to the parallelism that SSE2 instruction provide. This technique results in the more-optimized scalar code listed in Example 3.
Example 3. Optimized Scalar Code
__asm {
push esi
push edi
push ebx
push m1
push m2
pop edi
pop esi
xor ecx, ecx
xor edx, edx
xor ebx, ebx
xor eax, eax
mov bl, byte ptr [m1]
mov cl, byte ptr [m2]
sub ecx, ebx
mov bl, byte ptr 1[m1]
xor cl, ch
mov dl, byte ptr 1[m2]
sub cl, ch
sub edx,ebx
and ecx,0xff
xor dl, dh
add eax,ecx
sub dl, dh
and edx,0xff
mov bl, byte ptr 2[m1]
add eax,edx
mov cl, byte ptr 2[m2]
sub ecx,ebx
mov bl, byte ptr 3[m1]
xor cl, ch
mov dl, byte ptr 3[m2]
sub cl, ch
sub edx,ebx
and ecx,0xff
xor dl, dh
add eax,ecx
sub dl, dh
and edx,0xff
mov bl, byte ptr 4[m1]
add eax, edx
mov cl, byte ptr 4[m2]
sub ecx, ebx
mov bl, byte ptr 5[m1]
xor cl, ch
mov dl, byte ptr 5[m2]
sub cl, ch
sub edx,ebx
&
nbsp; and ecx,0xff
xor dl, dh
add eax,ecx
sub dl, dh
and edx,0xff
mov bl, byte ptr 6[m1]
add eax, edx
mov cl, byte ptr 6[m2]
sub ecx, ebx
mov bl, byte ptr 7[m1]
xor cl, ch
mov dl, byte ptr 7[m2]
sub cl, ch
sub edx, ebx
and ecx, 0xff
xor dl, dh
add eax, ecx
sub dl, dh
and edx, 0xff
mov bl, byte ptr 8[m1]
add eax, edx
mov cl, byte ptr 8[m2]
sub ecx, ebx
mov bl, byte ptr 9[m1]
xor cl, ch
mov dl, byte ptr 9[m2]
sub cl, ch
sub edx, ebx
and ecx, 0xff
xor dl, dh
add eax, ecx
sub dl, dh
and edx, 0xff
mov bl, byte ptr 10[m1]
add eax,edx
mov cl, byte ptr 10[m2]
sub ecx, ebx
mov bl, byte ptr 11[m1]
xor cl, ch
mov dl, byte ptr 11[m2]
sub cl, ch
sub edx, ebx
and ecx, 0xff
xor dl, dh
add eax,ecx
sub dl, dh
and edx, 0xff
mov bl, byte ptr 12[m1]
add eax, edx
mov cl, byte ptr 12[m2]
sub ecx, ebx
mov bl, byte ptr 13[m1]
xor cl, ch
mov dl, byte ptr 13[m2]
sub cl, ch
sub edx,ebx
and ecx, 0xff
xor dl, dh
add eax, ecx
sub dl, dh
and edx,0xff
mov bl, byte ptr 14[m1]
add eax, edx
mov cl, byte ptr 14[m2]
sub ecx, ebx
mov bl, byte ptr 15[m1]
xor cl, ch
mov dl, byte ptr 15[m2]
sub cl, ch
sub edx,ebx
and ecx, 0xff
xor dl, dh
add eax, ecx
sub dl, dh
and edx, 0xff
pop ebx
add eax, edx
}
Performance Gains
The motion-estimation technique using SSE2 instructions that is described in this paper compares extremely favorably with the corresponding C implementation, as well as with the optimized scalar version. On an Intel® Pentium® processor implementation, the speedups are as follows:
- SSE2 technology speedup over C implementation: 13X
- SSE2 technology speedup over optimized scalar: 5X
Absolute Difference Function Code Listing
.586 include ia_emm.inc ASSUME ds:FLAT, cs:FLAT, ss:FLAT _TEXT SEGMENT DWORD PUBLIC USE32 'CODE' _TEXT ENDS _TEXT SEGMENT DWORD PUBLIC USE32 'CODE' ; performing saturated subtractions and merging ; results of differences (por). ; unpacking is then performed to 16-bit precision ; to accumulate differences without ; overflow. Accumulation into xmm1, and result is ; broadcast across 16-byte register. estim Proc C Public uses m1: DWORD, m1:DWORD movdqu xmm0, XMMWORD PTR [m1] ; grab bptr row movdqu xmm1, XMMWORD PTR [m2] ; grab cptr row movdqa xmm2, xmm2 ; make a copy for abs diff psubusb xmm0, xmm1 ; do subtraction one way psubusb xmm1, xmm2 ; do subtraction the other way por xmm0, xmm1 ; merge results movdqa xmm1, xmm0 ; keep a copy punpcklbw xmm0, xmm6 ; unpack to higher precision punpcklbw xmm1, xmm6 ; unpack to higher precision movdqa xmm3, xmm1 ; keep a copy for accumulation pshufd xmm1, xmm0, 238 ; shuffle high bits into xmm1 pshufd xmm3, xmm0, 68 ; shuffle low bits into xmm3 paddw xmm1, xmm3 ; add results movdqa xmm4, xmm1 ; keep a copy pshufd xmm4, xmm4, 78 ; shuffle bits to add paddw xmm1, xmm4 ; add results and store xmm1 estim EndP
Related Resources
