Motion Estimation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)

by Kiefer Kuah
April 2007
Intel Software Solutions Group

Abstract

Intel® SSE4 is a new set of Single Instruction Multiple Data (SIMD) instructions that will be introduced in the 45nm Next Generation Intel® Core™2 processor family (Penryn) and improve the performance and energy efficiency of a broad range of applications.

This white paper describes how video encoders can utilize Intel SSE4 instructions to achieve 1.6x to 3.8x performance speedups in integer motion vector search, a frequently used motion estimation function.

Contents

  1. Introduction
  2. Motion Estimation Using MPSADBW and PHMINPOSUW
  3. Results
  4. Conclusion

   A.   SSE2 - Optimized Function for 4x4 Blocks
   B.   Intel® SSE4 - Optimized Function for 4x4 Blocks
   C.   SSE2 - Optimized Function for 8x8 Blocks
   D.   Intel® SSE4 - Optimized Function for 8x8 Blocks
   E.   SSE2 - Optimized Function for 16x16 Blocks
   F.    Intel® SSE4 - Optimized Function for 16x16 Blocks

 

1. Introduction

Intel® Streaming SIMD Extensions 4 (Intel® SSE4) is a new set of Single Instruction Multiple Data (SIMD) instructions designed to improve the performance of various applications, such video encoders, image processing, and 3D games. Intel SSE4 builds upon the Intel® 64 and IA-32 instruction set, the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 will be introduced in the 45nm Next Generation Intel® Core™2 processor family (Penryn).

This white paper will describe how video encoders can benefit from the Intel SSE4 instructions, achieving 1.6x to 3.8x performance speedups in integer motion vector search, a frequently used motion estimation function. Three different block sizes, 4x4, 8x8, and 16x16, are used in this paper to represent some of the variations that are used in motion estimation and to illustrate how the code can be adapted to suit these variations.

2. Motion Estimation Using MPSADBW and PHMINPOSUW

Motion estimation is one of the main bottlenecks in video encoders. It involves searching reference frames for best matches and often accounts for about 40% of the total CPU cycles consumed by an encoder. The quality of the search is a factor that determines the compression ratio and the video quality of the enco ded video. This search operation is often the target of algorithmic and SIMD optimizations to improve the encoding speed. An un-optimized version of the block matching function for 4x4 block size is shown in Figure 2-1. The example code in this paper performs only the integer motion vector search of the motion estimation stage.

Figure 2-1. Unoptimized Version of an Integer Block Matching Function

int blockMatch4x4(const unsigned char* refFrame, int stepBytesRF /*stride*/, const

unsigned char* curBlock, int stepBytesCB /*stride*/, int* matchBlock /*results*/, int

frameWidth, int frameHeight)

{

int lowSum = INT_MAX;

int temSum = 0;

int blockHeight = 4;

int blockWidth = 4;

const unsigned char *pRef, *pCur;

for (int i=0; i<=frameHeight-blockHeight; i++)

{

for (int j=0; j<=frameWidth-blockWidth; j++)

{

temSum = 0;

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;

for (int k=0; k<blockHeight; k++)

{

for (int l=0; l<blockWidth; l++)

{

temSum += labs(*pRef-*pCur);

pCur++;

pRef++;

}

pCur+=stepBytesCB-blockWidth;

pRef+=stepBytesRF-blockWidth;

}

if (temSum < lowSum)

{

lowSum = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}

}

}

return 0;

}

 

SIMD instructions are ideal in motion estimation because the arithmetic operations that are done in the search operation are done for blocks of pixels and are inherently parallel. The SSE2 instruction, PSADBW, is widely used by developers to optimize this operation. PSADBW computes two sums of absolute differences from a pair of 16 unsigned byte integers. One sum is the result from the eight low unsigned byte integers whereas the other sum is the result from the eight high unsigned byte integers[1].

Figure 2-2 shows the motion estimation function that has been optimized to use the PSADBW instruction. This function finds the matching blocks for four 4x4 blocks in each call. PSADBW computes the sum of absolute difference of eight unsigned byte integers but the width of a 4x4 block is only 4 bytes. In order to use this instruction, two rows were first unpacked to concatenate two 4-byte data sets into 8 bytes. In addition, since each load gets 16 consecutive bytes, data are loaded from four consecutive blocks in one load. Therefore, it makes sense to write this function to find the matching block for four blocks in each call.

Figure 2-2. SSE2 – Optimized Integer Block Matching Function, Finding the Matching Block for Four 4x4 Blocks in Each Call

//finds matching blocks for four 4x4 blocks in each call

int blockMatch4x4SSE2(const unsigned char* refFrame, int stepBytesRF, const unsigned 

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum[4] = {UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX};

unsigned int temSum = 0;

int blockHeight = 4;

int blockWidth = 4;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;


pCur = curBlock;

s0 = _mm_loadu_si128((__m128i*)pCur);

s1 = _mm_loadu_si128((__m128i*)(pCur+stepBytesCB));

s2 = _mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB));

s3 = _mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB));


s8 = _mm_unpacklo_epi32(s0, s1);

s9 = _mm_unpacklo_epi32(s2, s3);

s10 = _mm_unpackhi_epi32(s0, s1);

s11 = _mm_unpackhi_epi32(s2, s3);


for (int i=0; i<=frameHeight-blockHeight; i++)

{

for (int j=0; j<=frameWidth-blockWidth; j++)

{

pRef = refFrame+i*stepBytesRF+j;


s6 = _mm_unpacklo_epi32(

_mm_cvtsi32_si128(*(unsigned int*)pRef), 

_mm_cvtsi32_si128(*(unsigned int*)(pRef+stepBytesRF))

);

s6 = _mm_shuffle_epi32(s6, 0x44);

s7 = _mm_unpacklo_epi32(

_mm_cvtsi32_si128(*(unsigned int*)(pRef+2*stepBytesRF)),

_mm_cvtsi32_si128(*(unsigned int*)(pRef+3*stepBytesRF))

);

s7 = _mm_shuffle_epi32(s7, 0x44);


s0 = _mm_adds_epu16(_mm_sad_epu8(s6, s8), _mm_sad_epu8(s7, s9));

s1 = _mm_adds_epu16(

           _mm_sad_epu8(s6, s10), 

_mm_sad_epu8(s7, s11)

);


temSum = _mm_extract_epi16(s0,0);

if (temSum < lowSum[0])

{

lowSum[0] = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}


...


//Repeat for the remaining 3 sums


...

}

}

return 0;

}

 

The Intel SSE4 instruction, MPSADBW, computes eight sums of difference in one instruction. Each sum is computed from the absolute difference of a pair of four unsigned byte integers. Figure 2-3 illustrates how the eight sums are computed using MPSADBW. MPSADBW takes an immediate as a third operand. Bits 0 and 1 of the immediate are used to select one of the four groups of 4 bytes from the source operand. Bit 2 of the immediate is used to select one of the two groups of 11 bytes from the destination operand. In Figure 2-3, the box with the darkened solid outline indicates the selected block in this illustration. The box with the darkened broken outline indicates the other blocks that could be selected by the setting the corresponding bits in the immediate.

While the ideal block size for using this instruction is 4x4, other block sizes, such as 8x4 or 8x8, can benefit from this instruction as well. Bits 0 and 1 of the immediate value can be used to select a different 4-pixel group to use in the computation as shown in Figure 2-3. Therefore, to compute sums of absolute difference for block sizes that are multiples of 4x4, we repeat the MPSADBW operation using a different immediate value each time and then add the results from the multiple MPSADBW operations together using PADDUSW to yield the final results.

Figure 2-3. Eight Sums that are Computed using MPSADBW

 

After the sums are computed, the PHMINPOSUW instruction is used to locate the minimum from the SADs computed. A code sample using Intel SSE4 for motion search is shown in Figure 2-4.

Figure 2-4. Intel® SSE4 – Optimized Integer Block Matching Function, Finding the Matching Block for Four 4x4 Blocks in Each Call

//finds matching blocks for four 4x4 blocks in each call

int blockMatch4x4SSE4(const unsigned char* refFrame, int stepBytesRF, const unsigned

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)

{

unsigned int lowSum[4] = {UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX};

unsigned int temSum = 0;

int blockHeight = 4;

int blockWidth = 4;

int k;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;


pCur = curBlock;

s0 = _mm_loadu_si128((__m128i*)pCur);

s1 = _mm_loadu_si128((__m128i*)(pCur+stepBytesCB));

s2 = _mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB));

s3 = _mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB));


s8 = _mm_unpacklo_epi32(s0, s1);

s9 = _mm_unpacklo_epi32(s2, s3);

s10 = _mm_unpackhi_epi32(s0, s1);

s11 = _mm_unpackhi_epi32(s2, s3);


for (int i=0; i<=frameHeight-blockHeight; i++)

{

int j = 0;

for (j=0; j<=frameWidth-16; j+=8)

{

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;

s2 = _mm_setzero_si128();

s3 = _mm_setzero_si128();

s4 = _mm_setzero_si128();

s5 = _mm_setzero_si128();

for (k=0; k<blockHeight; k++)

{

s0 = _mm_loadu_si128((__m128i*)pRef);

s1 = _mm_loadu_si128((__m128i*)pCur);

s2 = _mm_adds_epu16(s2, _mm_mpsadbw_epu8(s0, s1, 0));

s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s0, s1, 1));

s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s0, s1, 2));

s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s0, s1, 3));

pCur+=stepBytesCB;

pRef+=stepBytesRF;

}

s6 = _mm_minpos_epu16(s2);

temSum = _mm_extract_epi16(s6,0);

if (temSum < lowSum[0])

{

lowSum[0] = temSum;

k = _mm_extract_epi16(s6,1);

*matchBlock = j+k;

*(matchBlock+1) = i;

}


...


// Repeat for the remaining 3 sums


...

}

for (; j<=frameWidth-blockWidth; j++)

{


...


// Use SSE2 code to process the rest, the edge case.


...

}

}

return 0;

}

 

3. Results

The speedups obtained from the optimizations described in the previous sections are shown in Table 3-1. The results in the table are expressed as number of cycles per block SAD computed[2]. The speedup column contains ratios computed using the SSE2 results as the baseline. Three different block sizes, 4x4, 8x8 and 16x16, were tested. At first glance, it might appear that the MPSADBW instruction can only be applied to 4-byte wide blocks. However, MPSADBW can also be utilized to compute 8-byte wide and 16-byte wide sums of absolute difference, as shown in Appendix D and F.

The Intel® Compiler 10.0.018 beta was used to build the code. The 'O2' and 'QxS' compiler flags were used. 'QxS' is a new flag for the compiler to generate optimized code specifically for Penryn. The speedups from the SSE2-optimized function to the Intel SSE4-optimized function ranged from 1.6x to 3.8x. In addition to the speedups seen from Intel SSE4, we also observed the speedups from multi-threading (Figure 3-1).

Table 3-1. Number of Cycles per Block SAD Computed

< td width="87">

1.00

Code Sample

Cycles / Block SAD

Speedup

4x4 Block

C++

54.84

 

SSE2

4.32

1.00

Intel SSE4

2.71

1.59

8x8 Block

C++

180.55

 

SSE2

25.29

Intel SSE4

6.73

3.83

16x16 Block

C++

173.01

 

SSE2

71.42

1.00

Intel SSE4

26.86

2.66

 

 

Figure 3-1. Chart Showing the Speedups Obtained Compared to the SSE2
Non-Threaded Version

 

Chart Showing the Speedups Obtained Compared to the SSE2 Non-Threaded Version

4. Conclusion

The use of Intel SSE4 instructions was shown to improve performance of integer motion vector search functions in motion estimation. The performance gain from the baseline SSE2 functions ranged from 1.6x to 3.8x. In some versions of the block matching functions, such as the one for 8x8 block size using Intel SSE4 instructions, the function was written to find the motion vector for more than a block in each function call. This was done in order to occupy all the data slots in the 128-bit registers and maximize the output from the vector operations. From these examples, we see that it is necessary to specifically adapt the code to each block size to get the full benefit of the 128-bit vector operations. We also demonstrated that the MPSADBW instruction can be used to compute sums of absolute difference not only for 4-byte wide blocks, but also for 8-byte wide and 16-byte wide blocks and therefore can be used in algorithms that use different block sizes. In addition to the Intel SSE4 instructions, multithreaded implementations resulted in further performance boosts. Intel SSE4 instructions will enable developers to optimize their applications to achieve higher performance in compute intensive applications such as video codecs.

 

 

 

A. SSE2 – Optimized Function for 4x4 Blocks

//finds matching blocks for four 4x4 blocks in each call

int blockMatch4x4SSE2(const unsigned char* refFrame, int stepBytesRF, const unsigned 

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum[4] = {UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX};

unsigned int temSum = 0;

int blockHeight = 4;

int blockWidth = 4;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;


pCur = curBlock;

s0 = _mm_loadu_si128((__m128i*)pCur);

s1 = _mm_loadu_si128((__m128i*)(pCur+stepBytesCB));

s2 = _mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB));

s3 = _mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB));


s8 = _mm_unpacklo_epi32(s0, s1);

s9 = _mm_unpacklo_epi32(s2, s3);

s10 = _mm_unpackhi_epi32(s0, s1);

s11 = _mm_unpackhi_epi32(s2, s3);


for (int i=0; i<=frameHeight-blockHeight; i++)

{

for (int j=0; j<=frameWidth-blockWidth; j++)

{

pRef = refFrame+i*stepBytesRF+j;


s6 = _mm_unpacklo_epi32(

_mm_cvtsi32_si128(*(unsigned int*)pRef), 

_mm_cvtsi32_si128(*(unsigned int*)(pRef+stepBytesRF))

);

s6 = _mm_shuffle_epi32(s6, 0x44);

s7 = _mm_unpacklo_epi32(

_mm_cvtsi32_si128(*(unsigned int*)(pRef+2*stepBytesRF)),

_mm_cvtsi32_si128(*(unsigned int*)(pRef+3*stepBytesRF))

);

s7 = _mm_shuffle_epi32(s7, 0x44);


s0 = _mm_adds_epu16(_mm_sad_epu8(s6, s8), _mm_sad_epu8(s7, s9));

s1 = _mm_adds_epu16(

_mm_sad_epu8(s6, s10), 

_mm_sad_epu8(s7, s11)

);


temSum = _mm_extract_epi16(s0,0);


if (temSum < lowSum[0])

{

lowSum[0] = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}


temSum = _mm_extract_epi16(s0,4);

if (temSum < lowSum[1])

{

lowSum[1] = temSum;

*(matchBlock+2) = j;

*(matchBlock+3) = i;

}


temSum = _mm_extract_epi16(s1,0);

if (temSum < lowSum[2])

{

lowSum[2] = temSum;

*(matchBlock+4) = j;

*(matchBlock+5) = i;

}


temSum = _mm_extract_epi16(s1,4);

if (temSum < lowSum[3])

{

lowSum[3] = temSum;

*(matchBlock+6) = j;

*(matchBlock+7) = i;

}

}

}

return 0;

}

 


B. Intel® SSE4 – Optimized Function for 4x4 Blocks

//finds matching blocks for four 4x4 blocks in each call 

int blockMatch4x4SSE4(const unsigned char* refFrame, int stepBytesRF, const unsigned 

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight) 

{ 

unsigned int lowSum[4] = {UINT_MAX, UINT_MAX, UINT_MAX, UINT_MAX}; 

unsigned int temSum = 0; 

int blockHeight = 4;  

int blockWidth = 4; 

int k; 

const unsigned char *pRef, *pCur; 

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11;


pCur = curBlock; 

s0 = _mm_loadu_si128((__m128i*)pCur); 

s1 = _mm_loadu_si128((__m128i*)(pCur+stepBytesCB)); 

s2 = _mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB)); 

s3 = _mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB)); 

s8 = _mm_unpacklo_epi32(s0, s1); 

s9 = _mm_unpacklo_epi32(s2, s3); 

s10 = _mm_unpackhi_epi32(s0, s1); 

s11 = _mm_unpackhi_epi32(s2, s3);

for (int i=0; i<=frameHeight-blockHeight; i++) 

{ 

int j = 0; 

for (j=0; j<=frameWidth-16; j+=8) 

{ 

pCur = curBlock; 

pRef = refFrame+i*stepBytesRF+j; 

s2 = _mm_setzero_si128(); 

s3 = _mm_setzero_si128(); 

s4 = _mm_setzero_si128(); 

s5 = _mm_setzero_si128(); 

for (k=0; k<blockHeight; k++) 

{ 

s0 = _mm_loadu_si128((__m128i*)pRef); 

s1 = _mm_loadu_si128((__m128i*)pCur); 

s2 = _mm_adds_epu16(s2, _mm_mpsadbw_epu8(s0, s1, 0)); 

s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s0, s1, 1)); 

s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s0, s1, 2)); 

s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s0, s1, 3)); 

pCur+=stepBytesCB; 

pRef+=stepBytesRF; 

} 

s6 = _mm_minpos_epu16(s2); 

temSum = _mm_extract_epi16(s6,0); 

if (temSum < lowSum[0]) 

{ 

lowSum[0] = temSum; 

k = _mm_extract_epi16(s6,1); 

*matchBlock = j+k; 

*(matchBlock+1) = i; 

} 

s6 = _mm_minpos_epu16(s3); 

temSum = _mm_extract_epi16(s6,0); 

if (temSum < lowSum[1]) 

{ 

lowSum[1] = temSum; 

k = _mm_extract_epi16(s6,1); 

*(matchBlock+2) = j+k; 

*(matchBlock+3) = i; 

} 

s6 = _mm_minpos_epu16(s4); 

temSum = _mm_extract_epi16(s6,0); 

if (temSum < lowSum[2]) 

{ 

lowSum[2] = temSum; 

k = _mm_extract_epi16(s6,1); 

*(matchBlock+4) = j+k; 

*(matchBlock+5) = i; 

} 

s6 = _mm_minpos_epu16(s5); 

temSum = _mm_extract_epi16(s6,0); 

if (temSum < lowSum[3]) 

{ 

lowSum[3] = temSum; 

k = _mm_extract_epi16(s6,1); 

*(matchBlock+6) = j+k; 

*(matchBlock+7) = i; 

} 

} 

for (; j<=frameWidth-blockWidth; j++) 

{ 

pRef = refFrame+i*stepBytesRF+j; 

s6 = _mm_unpacklo_epi32( 

_mm_cvtsi32_si128(*(unsigned int*)pRef), 

_mm_cvtsi32_si128(*(unsigned int*)(pRef+stepBytesRF)) 

); 

s6 = _mm_shuffle_epi32(s6, 0x44); 

s7 = _mm_unpacklo_epi32( 

_mm_cvtsi32_si128(*(unsigned int*)(pRef+2*stepBytesRF)), 

_mm_cvtsi32_si128(*(unsigned int*)(pRef+3*stepBytesRF)) 

); 

s7 = _mm_shuffle_epi32(s7, 0x44); 

s0 = _mm_adds_epu16(_mm_sad_epu8(s6, s8), _mm_sad_epu8(s7, s9)); 

s1 = _mm_adds_epu16( 

_mm_sad_epu8(s6, s10), 

_mm_sad_epu8(s7, s11) 

); 

temSum = _mm_extract_epi16(s0,0); 

if (temSum < lowSum[0]) 

{ 

lowSum[0] = temSum; 

*matchBlock = j; 

*(matchBlock+1) = i; 

} 

temSum = _mm_extract_epi16(s0,4); 

if (temSum < lowSum[1]) 

{ 

lowSum[1] = temSum; 

*(matchBlock+2) = j; 

*(matchBlock+3) = i; 

} 

temSum = _mm_extract_epi16(s1,0); 

if (temSum < lowSum[2]) 

{ 

lowSum[2] = temSum; 

*(matchBlock+4) = j; 

*(matchBlock+5) = i; 

} 

temSum = _mm_extract_epi16(s1,4); 

if (temSum < lowSum[3]) 

{ 

lowSum[3] = temSum; 

*(matchBlock+6) = j; 

*(matchBlock+7) = i; 

} 

} 

} 

return 0; 

}

 


C. SSE2 – Optimized Function for 8x8 Blocks

int blockMatch8x8SSE2(const unsigned char* refFrame, int stepBytesRF, const unsigned 

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum = UINT_MAX;

unsigned int temSum[2] = {0, 0};

int blockHeight = 8;

int blockWidth = 8;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8;


pCur = curBlock;

s0 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)pCur), 0x44);

s1 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+stepBytesCB)), 0x44);

s2 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB)), 0x44);

s3 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB)), 0x44);

s4 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+4*stepBytesCB)), 0x44);

s5 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+5*stepBytesCB)), 0x44);

s6 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+6*stepBytesCB)), 0x44);

s7 = _mm_shuffle_epi32(_mm_loadu_si128((__m128i*)(pCur+7*stepBytesCB)), 0x44);


for (int i=0; i<=frameHeight-blockHeight; i++)

{

int j;

for (j=0; j<frameWidth-16; j+=16)

{

for (int k=0; k<8; k++)

{

pRef = refFrame+i*stepBytesRF+j+k;


s8 = _mm_sad_epu8(s0, _mm_loadu_si128((__m128i*)pRef));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s1, 

_mm_loadu_si128((__m128i*)(pRef+stepBytesRF))));

s8 = _mm_adds_epu16(s8,

_mm_sad_epu8(s2, 

_mm_loadu_si128((__m128i*)(pRef+2*stepBytesRF))));

s8 = _mm_adds_epu16(s8,

_mm_sad_epu8(s3, 

_mm_loadu_si128((__m128i*)(pRef+3*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s4,

_mm_loadu_si128((__m128i*)(pRef+4*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s5,

;  _mm_loadu_si128((__m128i*)(pRef+5*stepBytesRF))));

s8 = _mm_adds_epu16(s8,

_mm_sad_epu8(s6,

_mm_loadu_si128((__m128i*)(pRef+6*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s7,

_mm_loadu_si128((__m128i*)(pRef+7*stepBytesRF))));

temSum[0] = _mm_extract_epi16(s8,0);

temSum[1] = _mm_extract_epi16(s8,4);

if (temSum[0] <= temSum[1] && temSum[0] < lowSum)

{

lowSum = temSum[0];

*matchBlock = j+k;

*(matchBlock+1) = i;

}

else if (temSum[1] < lowSum)

{

lowSum = temSum[1];

*matchBlock = j+k+8;

*(matchBlock+1) = i;

}

}

}

for (; j<=frameWidth-blockWidth; j++)

{

pRef = refFrame+i*stepBytesRF+j;


s8 = _mm_sad_epu8(s0, _mm_loadl_epi64((__m128i*)pRef));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s1, 

_mm_loadl_epi64((__m128i*)(pRef+stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s2, 

_mm_loadl_epi64((__m128i*)(pRef+2*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s3, 

_mm_loadl_epi64((__m128i*)(pRef+3*stepBytesRF))));

s8 = _mm_adds_epu16(s8,

_mm_sad_epu8(s4, 

_mm_loadl_epi64((__m128i*)(pRef+4*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s5,

_mm_loadl_epi64((__m128i*)(pRef+5*stepBytesRF))));

s8 = _mm_adds_epu16(s8,

_mm_sad_epu8(s6,

_mm_loadl_epi64((__m128i*)(pRef+6*stepBytesRF))));

s8 = _mm_adds_epu16(s8, 

_mm_sad_epu8(s7, 

_mm_loadl_epi64((__m128i*)(pRef+7*stepBytesRF))));


temSum[0] = _mm_extract_epi16(s8,0);

if (temSum[0] < lowSum)

{

lowSum = temSum[0];

*matchBlock = j;

*(matchBlock+1) = i;

}

}

}

return 0;

}

 


D. Intel® SSE4 – Optimized Function for 8x8 Blocks

//searches two 8x8 blocks in each call

int blockMatch8x8SSE4(const unsigned char* refFrame, int stepBytesRF, const unsigned

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum[2] = {UINT_MAX, UINT_MAX};

unsigned int temSum = 0;

int blockHeight = 8;

int blockWidth = 8;

int k;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6;


for (int i=0; i<=frameHeight-blockHeight; i++)

{

int j=0;

for (j=0; j<=frameWidth-16; j+=8)

{

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;

s2 = _mm_setzero_si128();

s3 = _mm_setzero_si128();

s4 = _mm_setzero_si128();

s5 = _mm_setzero_si128();

for (k=0; k<blockHeight; k++)

{

s0 = _mm_loadu_si128((__m128i*)pRef);

s1 = _mm_loadu_si128((__m128i*)pCur);

s2 = _mm_adds_epu16(s2, _mm_mpsadbw_epu8(s0, s1, 0));

s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s0, s1, 5));

s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s0, s1, 2));

s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s0, s1, 7));

pCur+=stepBytesCB;

pRef+=stepBytesRF;

}

s6 = _mm_minpos_epu16(_mm_adds_epu16(s2, s3));

temSum = _mm_extract_epi16(s6,0);

if (temSum < lowSum[0])

{

lowSum[0] = temSum;

k = _mm_extract_epi16(s6,1);

*matchBlock = j+k;

*(matchBlock+1) = i;

}

s6 = _mm_minpos_epu16(_mm_adds_epu16(s4, s5));

temSum = _mm_extract_epi16(s6,0);

if (temSum < lowSum[1])

{

lowSum[1] = temSum;

k = _mm_extract_epi16(s6,1);

*(matchBlock+2) = j+k;

*(matchBlock+3) = i;

}

}


for (; j<=frameWidth-blockWidth; j++)

{

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;


s2 = _mm_setzero_si128();

for (k=0; k<blockHeight; k++)

{

s0 = _mm_loadl_epi64((__m128i*)pRef);

s0 = _mm_shuffle_epi32(s0, 0x44);

s1 = _mm_loadu_si128((__m128i*)pCur);

s2 = _mm_adds_epu16(s2, _mm_sad_epu8(s0, s1));


pCur+=stepBytesCB;

pRef+=stepBytesRF;

}


temSum = _mm_extract_epi16(s2,0);

if (temSum < lowSum[0])

{

lowSum[0] = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}

temSum = _mm_extract_epi16(s2,4);

if (temSum < lowSum[1])

{

lowSum[1] = temSum;

*(matchBlock+2) = j;

*(matchBlock+3) = i;

}

}

}

return 0;

}

 


E. SSE2 – Optimized Function for 16x16 Blocks

int blockMatch16x16SSE2(const unsigned char* refFrame, int stepBytesRF, const unsigned

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum = UINT_MAX;

unsigned int temSum = 0;

int blockHeight = 16;

int blockWidth = 16;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16;


pCur = curBlock;

s0 = _mm_loadu_si128((__m128i*)pCur);

s1 = _mm_loadu_si128((__m128i*)(pCur+stepBytesCB));

s2 = _mm_loadu_si128((__m128i*)(pCur+2*stepBytesCB));

s3 = _mm_loadu_si128((__m128i*)(pCur+3*stepBytesCB));

s4 = _mm_loadu_si128((__m128i*)(pCur+4*stepBytesCB));

s5 = _mm_loadu_si128((__m128i*)(pCur+5*stepBytesCB));

s6 = _mm_loadu_si128((__m128i*)(pCur+6*stepBytesCB));

s7 = _mm_loadu_si128((__m128i*)(pCur+7*stepBytesCB));

s8 = _mm_loadu_si128((__m128i*)(pCur+8*stepBytesCB));

s9 = _mm_loadu_si128((__m128i*)(pCur+9*stepBytesCB));

s10 = _mm_loadu_si128((__m128i*)(pCur+10*stepBytesCB));

s11 = _mm_loadu_si128((__m128i*)(pCur+11*stepBytesCB));

s12 = _mm_loadu_si128((__m128i*)(pCur+12*stepBytesCB));

s13 = _mm_loadu_si128((__m128i*)(pCur+13*stepBytesCB));

s14 = _mm_loadu_si128((__m128i*)(pCur+14*stepBytesCB));

s15 = _mm_loadu_si128((__m128i*)(pCur+15*stepBytesCB));


for (int i=0; i<=frameHeight-blockHeight; i++)

{

for (int j=0; j<=frameWidth-blockWidth; j++)

{

pRef = refFrame+i*stepBytesRF+j;


s16 = _mm_sad_epu8(s0, _mm_loadu_si128((__m128i*)pRef));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s1, 

_mm_loadu_si128((__m128i*)(pRef+stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s2,

_mm_loadu_si128((__m128i*)(pRef+2*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s3, 

_mm_loadu_si128((__m128i*)(pRef+3*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s4,

_mm_loadu_si128((__m128i*)(pRef+4*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s5,

_mm_loadu_si128((__m128i*)(pRef+5*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s6,

_mm_loadu_si128((__m128i*)(pRef+6*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s7,

_mm_loadu_si128((__m128i*)(pRef+7*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s8,

_mm_loadu_si128((__m128i*)(pRef+8*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s9,

;

_mm_loadu_si128((__m128i*)(pRef+9*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s10,

_mm_loadu_si128((__m128i*)(pRef+10*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s11,

_mm_loadu_si128((__m128i*)(pRef+11*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s12,

_mm_loadu_si128((__m128i*)(pRef+12*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s13, 

_mm_loadu_si128((__m128i*)(pRef+13*stepBytesRF))));

s16 = _mm_adds_epu16(s16, 

_mm_sad_epu8(s14,

_mm_loadu_si128((__m128i*)(pRef+14*stepBytesRF))));

s16 = _mm_adds_epu16(s16,

_mm_sad_epu8(s15,

_mm_loadu_si128((__m128i*)(pRef+15*stepBytesRF))));

temSum = _mm_extract_epi16(s16,0) + _mm_extract_epi16(s16,4);

if (temSum < lowSum)

{

lowSum = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}

}

}

return 0;

}

 


F. Intel® SSE4 – Optimized Function for 16x16 Blocks

int blockMatch16x16SSE4(const unsigned char* refFrame, int stepBytesRF, const unsigned

char* curBlock, int stepBytesCB, int* matchBlock, int frameWidth, int frameHeight)
{

unsigned int lowSum = UINT_MAX;

unsigned int temSum = 0;

int blockHeight = 16;

int blockWidth = 16;

int k;

const unsigned char *pRef, *pCur;

__m128i s0, s1, s2, s3, s4, s5, s6, s7;

for (int i=0; i<=frameHeight-blockHeight; i++)

{

int j=0;

;for (j=0; j<=frameWidth-24; j+=8)

{

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;

s3 = _mm_setzero_si128();

s4 = _mm_setzero_si128();

s5 = _mm_setzero_si128();

s6 = _mm_setzero_si128();

for (k=0; k<blockHeight; k++)

{

s0 = _mm_loadu_si128((__m128i*)pRef);

s1 = _mm_loadu_si128((__m128i*)(pRef+8));

s2 = _mm_loadu_si128((__m128i*)pCur);

s3 = _mm_adds_epu16(s3, _mm_mpsadbw_epu8(s0, s2, 0));

s4 = _mm_adds_epu16(s4, _mm_mpsadbw_epu8(s0, s2, 5));

s5 = _mm_adds_epu16(s5, _mm_mpsadbw_epu8(s1, s2, 2));

s6 = _mm_adds_epu16(s6, _mm_mpsadbw_epu8(s1, s2, 7));

pCur+=stepBytesCB;

pRef+=stepBytesRF;

}

s7 = _mm_adds_epu16(_mm_adds_epu16(s3, s4), _mm_adds_epu16(s5, s6));

s7 = _mm_minpos_epu16(s7);

temSum = _mm_extract_epi16(s7,0);

if (temSum < lowSum)

{

lowSum = temSum;

k = _mm_extract_epi16(s7,1);

*matchBlock = j+k;

*(matchBlock+1) = i;

}

}


for (; j<=frameWidth-blockWidth; j++)

{

pCur = curBlock;

pRef = refFrame+i*stepBytesRF+j;


s2 = _mm_setzero_si128();

for (k=0; k<blockHeight; k++)

{

s0 = _mm_loadu_si128((__m128i*)pRef);

s1 = _mm_loadu_si128((__m128i*)pCur);

s2 = _mm_adds_epu16(s2, _mm_sad_epu8(s0, s1));


pCur+=stepBytesCB;

pRef+=stepBytesRF;


}


temSum = _mm_extract_epi16(s2,0) + _mm_extract_epi16(s2,4);

if (temSum < lowSum)

{

lowSum = temSum;

*matchBlock = j;

*(matchBlock+1) = i;

}

}

}

return 0;

}

 


[1]                 For more details on PSADBW, see the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B.

[2]                 The size of the search region was 128x128 pixels. For a search region of this size, the number of block SAD computed was 15,625 for each 4x4 block, 14,641 for each 8x8 block and 12,769 for each 16x16 block. The speedups use the SSE2 version as the baseline. The code was built using Intel Compiler 10.0.018 Beta within Microsoft * Visual Studio* 2005. The O2 and QxS compiler flags were used in building these functions. The test system consisted of a 45nm Intel® quad core desktop processor (Yorkfield), pre-release “Bearlake” motherboard and 2GB DDR2 RAM PC2-6400 (400 MHz). The operating system was Windows* XP Professional with Service Pack 2.

Categories:
For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Ivan Kiselev's picture

blockMatch8x8SSE2() will slow down dramatically under 32-bits OS.
Because It can not uses 8 xmm registers as a local buffer (to store data from pCur).

Pages