# Absolute-Difference Motion Estimation for Intel® Pentium® 4 Processors

### Introduction

The media extensions to the Intel Architecture (IA) instruction set include single-instruction, multiple-data (SIMD) instructions. Streaming SIMD Extensions 2 (SSE2) instructions extend SIMD for the Intel NetBurst® microarchitecture with 144 new instructions. This paper describes an SSE2 implementation to calculate the absolute difference between two 16x16 blocks of pixels. This paper also compares the performance gains of that solution to non-SSE2 implementations. This SSE2 solution can be an integral part of a motion-estimation kernel.

### Motion Estimation

One important technique used in video compression is to predict movement between consecutive frames. In many cases, the visual representation of a moving object stays the same from frame to frame, merely changing position within the viewing field. A substantially-improved compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object separately in each frame. Calculating these vectors is called motion estimation, and it requires the calculation of an absolute difference between blocks of frames.

### Absolute Difference

The motion-estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks and finds the best match. In MPEG1, for example, the calculation can be made in four ways. Either the absolute difference between the pixel values (L1 norm) or the square of the differences (L2 Norm) is summed. Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted, either by some number of whole pixel positions or by some number of half-pixel positions.

A C code example for a 16x16 pixel full pel motion estimation using absolute differences is given in Example 1. The code has a fast-out, so that if the difference so far accumulated across a subset of rows is more than the best match so far, then it aborts the rest of the absolute-difference calculation.

Example 1: C Code to Perform 16x16 Pixel-Block Motion Estimation

char *btpr; /* pointer to start row of 16x16 pixel block being compressed */

char *cptr; /* pointer to start row of 16x16 pixel reference block */

val = 0;

for (i=0; i<16; i++) {

for (j=0; j<16; j++) {

data = (*(bptr++)- *(cptr++));

if (data<0){val -= data;}

else {val += data;}

}

/* Fast out after this row if best match has been exceeded */

if (val > best_value) break;

/* Update pointer to next row */

bptr += (rm->width - 16);

/* Update pointer
to next row */

cptr += (cm->width - 16);

}

### Optimized SSE2 Code

The flow of coding the motion-estimation inner loop using SSE2 instructions is shown in Figure 1. The operation uses a psubusb (packed byte subtract unsigned with saturation) to generate the absolute differences without requiring 16-bit precision.

Usually, subtraction of two 8-bit unsigned numbers produces a 9-bit signed result. Thus, in order to retain precision, it may seem that a conversation to 16 bits is needed before the subtract operation, but such conversion is unnecessary when using the psubusb instruction. By subtracting source 2 from source 1 and then performing the opposite operation (on a copy of one of the sources), each result register has the absolute difference value for the case when a respective element in source 1 was larger than a respective element in source 2, and zero otherwise (because a negative result saturates to zero). The same result occurs for the opposite subtraction, which generates the absolute differences in those cases where the elements in source 2 were larger than the elements in source 1.

Performing an OR operation on the results of the two subtractions generates the eight desired absolute-difference results. This technique allows the absolute difference to be performed in byte precision, which is substantially faster than if it were done in 16 bit precision.

Figure 1. Motion Estimation Inner Loop Using Packed Data

The SSE2 code for the inner-most computation of absolute difference is presented in Example 2. This code does not include the fast-out option that was incorporated into the C code of Example 1. In general, this fast-out is not relevant after each 16x1 estimation, as the overhead cost of adding the fast-out for every iteration of the SSE2 instruction loop is prohibitive.

Example 2. Inner-Core Absolute Differences for 16x16 Pixel Block

__asm {

movdqu xmm0, [m1]

movdqu xmm1, [m2]

movdqa xmm2, xmm0

psubusb xmm0, xmm1

psubusb xmm1, xmm2

por xmm0, xmm1

movdqa xmm1, xmm0

punpcklbw xmm0, xmm6

punpcklbw xmm1, xmm6

movdqa xmm3, xmm1

pshufd xmm1, xmm0, 238

pshufd xmm3, xmm0, 68

movdqa xmm4, xmm1

pshufd xmm4, xmm4, 78

}

### Optimized Scalar Implementation

It is possible to perform an absolute-difference calculation using the same technique without SSE2 instructions, but only on one data element at a time, as opposed to the parallelism that SSE2 instruction provide. This technique results in the more-optimized scalar code listed in Example 3.

Example 3. Optimized Scalar Code

__asm {

push   esi

push   edi

push   ebx

push   m1

push   m2

pop   edi

pop   esi

xor   ecx, ecx

xor   edx, edx

xor   ebx, ebx

xor   eax, eax

mov    bl, byte ptr [m1]

mov   cl, byte ptr [m2]

sub   ecx, ebx

mov   bl, byte ptr 1[m1]

xor   cl, ch

mov   dl, byte ptr 1[m2]

sub   cl, ch

sub   edx,ebx

and   ecx,0xff

xor   dl, dh

sub   dl, dh

and   edx,0xff

mov   bl, byte ptr 2[m1]

mov   cl, byte ptr 2[m2]

sub   ecx,ebx

mov   bl, byte ptr 3[m1]

xor   cl, ch

mov   dl, byte ptr 3[m2]

sub   cl, ch

sub   edx,ebx

and   ecx,0xff

xor   dl, dh

sub   dl, dh

and   edx,0xff

mov   bl, byte ptr 4[m1]

mov   cl, byte ptr 4[m2]

sub   ecx, ebx

mov   bl, byte ptr 5[m1]

xor   cl, ch

mov   dl, byte ptr 5[m2]

sub   cl, ch

sub   edx,ebx

&
nbsp;     and   ecx,0xff

xor   dl, dh

sub   dl, dh

and   edx,0xff

mov   bl, byte ptr 6[m1]

mov   cl,   byte ptr 6[m2]

sub   ecx, ebx

mov   bl, byte ptr 7[m1]

xor   cl, ch

mov   dl, byte ptr 7[m2]

sub   cl, ch

sub   edx, ebx

and   ecx, 0xff

xor   dl, dh

sub   dl, dh

and   edx, 0xff

mov   bl, byte ptr 8[m1]

mov   cl, byte ptr 8[m2]

sub   ecx, ebx

mov   bl, byte ptr 9[m1]

xor   cl, ch

mov   dl, byte ptr 9[m2]

sub   cl, ch

sub   edx, ebx

and   ecx, 0xff

xor   dl, dh

sub   dl, dh

and   edx, 0xff

mov   bl, byte ptr 10[m1]

mov   cl, byte ptr 10[m2]

sub   ecx, ebx

mov   bl, byte ptr 11[m1]

xor   cl, ch

mov   dl, byte ptr 11[m2]

sub   cl, ch

sub   edx, ebx

and   ecx, 0xff

xor   dl, dh

sub   dl, dh

and   edx, 0xff

mov   bl, byte ptr 12[m1]

mov   cl, byte ptr 12[m2]

sub   ecx, ebx

mov   bl, byte ptr 13[m1]

xor   cl, ch

mov   dl, byte ptr 13[m2]

sub   cl, ch

sub   edx,ebx

and   ecx, 0xff

xor   dl, dh

sub   dl, dh

and   edx,0xff

mov   bl, byte ptr 14[m1]

mov   cl, byte ptr 14[m2]

sub   ecx, ebx

mov   bl, byte ptr 15[m1]

xor   cl, ch

mov   dl, byte ptr 15[m2]

sub   cl, ch

sub   edx,ebx

and   ecx, 0xff

xor   dl, dh

sub   dl, dh

and   edx, 0xff

pop   ebx

}

### Performance Gains

The motion-estimation technique using SSE2 instructions that is described in this paper compares extremely favorably with the corresponding C implementation, as well as with the optimized scalar version. On an Intel® Pentium® processor implementation, the speedups are as follows:

• SSE2 technology speedup over C implementation: 13X
• SSE2 technology speedup over optimized scalar: 5X

### Absolute Difference Function Code Listing

.586
include ia_emm.inc
ASSUME   ds:FLAT, cs:FLAT, ss:FLAT
_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'
_TEXT ENDS

_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'

; performing saturated subtractions and merging
; results of differences (por).
; unpacking is then performed to 16-bit precision
; to accumulate differences without
; overflow. Accumulation into xmm1, and result is

estim Proc C Public uses m1: DWORD,
m1:DWORD
movdqu   xmm0, XMMWORD PTR [m1]   ; grab bptr row
movdqu   xmm1, XMMWORD PTR [m2]    ; grab cptr row
movdqa   xmm2, xmm2 ; make a copy for abs diff
psubusb   xmm0, xmm1 ; do subtraction one way
psubusb   xmm1, xmm2 ; do subtraction the other way
por   xmm0, xmm1 ; merge results
movdqa   xmm1, xmm0 ; keep a copy
punpcklbw   xmm0, xmm6 ; unpack to higher precision
punpcklbw   xmm1, xmm6 ; unpack to higher precision
movdqa   xmm3, xmm1 ; keep a copy for accumulation
pshufd   xmm1, xmm0, 238   ; shuffle high bits into xmm1
pshufd   xmm3, xmm0, 68   ; shuffle low bits into xmm3
movdqa   xmm4, xmm1       ; keep a copy
pshufd   xmm4, xmm4, 78   ; shuffle bits to add