How to Get Faster Video Rendering on the Intel® Pentium® 4 Processor

Submit New Article

June 18, 2009 12:00 AM PDT


by Eric L. Palmer

Abstract

This document describes how to make sure that your video rendering code is as fast as it can be on Intel® Pentium® 4 processor-based systems. This method applies to any application that handles data in a YUV 4:2:0 color-space and renders the video pictures to a display device of a different color-space. The key point is that the data needs to be written to the output buffer strictly in a linear order to optimize the performance of the memory sub-system. If this is not done, it is possible that your code may run slower on a faster Pentium® 4 processor-based system. The coding pitfall to be avoided here is called a write-combining order violation. Also presented is a related data-organization optimization for video codecs that do Motion Compensation or Motion Estimation that can lead to a significant application-level speedup.

Introduction

If you measure the time a part of your application takes to execute on a 1.7GHz Pentium® 4 processor, and on a 2.4GHz Pentium® 4 processor, you would expect the time on the 2.4GHz system to be less, or at least the same, right? If your application processes planar YUV data, such as YUV 4:2:0 or YV12 data, the part of your application that is responsible for sending the images to the display may have a problem that is so severe that it could actually take longer on a 2.4GHz system than on a 1.7GHz system. The speed of video rendering is limited by the speed of your system's main memory and by the AGP bus speed, so if everything else is the same except the frequency of the CPUs in two systems, the video rendering speed should be the same. This document describes how to identify whether your code has a write-combining order violation that causes a slowdown, and if so, how to eliminate it and get the best possible performance on any Pentium® 4 processor-based system.

YUV 4:2:0 Memory LayoutApplications that use planar YUV data include MPEG-1/2/4, M-JPEG, and DV codecs, as well as applications that call on these codecs such as DVD-playback software or video-editing software. This data is called planar because it is stored in three different blocks of memory, as shown to the right. The YUV 4:2:0 data format has one U sample and one V sample for every four Y samples, thus the U and V planes are each ¼ the size of the Y plane. The U and V samples are logically in the center of four Y samples as shown below.

Not all PC-based video display hardware directly supports the display of YUV 4:2:0 video, and it is common to convert it into the YUV 4:2:2 format or into a regular RGB format. Video codecsYUV 4:2:0 Sample Locations that employ motion compensation may want to convert to another output format for performance reasons, as described in the Further Optimization section below. Another reason for converting formats is to make software-based post-proce ssing, such as deinterlacing, easier and more efficient.

Before trying to identify a write-combining order violation, you may want to understand more about what write-combining memory is and how it different from write-back memory. There are three main types of memory in a PC — Write-back (WB), Uncachable (UC), and Write-combining (WC or USWC). Write-back memory is the type of memory normally used by most programs. Data written to WB memory is first written to the processor's cache, and is written to memory when it is evicted from the cache. On the Pentium® 4 processor, 64 bytes of data are written from the cache at a time. Data written to UC memory is written directly to memory and is never allowed to reside in the processor's cache. This data is written to memory 4 bytes at a time, and these memory writes are much less efficient than the 64-byte writes that occur with WB memory. The WC memory type was developed in conjunction with AGP to allow applications to write data to the UC memory on AGP devices (video cards), but with performance more like that of WB memory. To do this, the processor has a small number of 64-byte write-combining buffers that are the initial recipients of all WC writesUnderstanding Write Combining(see figure). When each buffer is filled, it writes the entire 64 bytes of data to memory. To achieve maximum WC performance, the data must be written to contiguous linear addresses. If, for example, only 4 bytes out of every 64-byte contiguous memory region is written to each WC buffer, then the contents of the buffers will be evicted with 4-byte memory writes. Any time a WC buffer is evicted when it is not full (called a WC Partial Write), its contents are written to memory with very inefficient 4-byte writes. If this occurs in your code, you will see a 4x or more slowdown, so it is very much worth some effort to avoid WC Partial Writes!

Finding Write-Combining Order Violations in Video-Rendering Code

Write-combining order violations occur when you have a piece of code that writes data to WC memory in an order that causes a large number of those writes to be WC Partial Writes. As mentioned above, WC Partial Writes are very inefficient — they cripple the performance of the AGP memory interface, and can cause your program to crawl where it should be flying. First, find the part of your code that writes to WC memory. In a DirectShow filter that is connected to the input of the overlay mixer, this will be where data is written to the memory associated with the filter's output pin. In a DirectX application, find the function that writes to the memory referenced by a pSurface->Lock() call. Examine such code as follows:

  • Verify that the WC memory is never read from. Reads from WC memory are even slower than WC Partial Writes, and your application should be designed such that they never occur. If you need to read the image data to modify it or create a new image, keep a copy of the data that will need to be read in regular WB memory, and then copy it to the WC memory.
  • Verify that the function(s) writing to the WC memory write to addresses that are contiguous and increasing. This means that only one row of the image should be written at a time. If your algorithm needs to process multiple output rows at a time, make sure that it uses temporary buffers in WB memory, and then copies each row to the WC destination.

    Note: You can use the WC Partial Write counter in VTune™ to verify that you have reduced the number of WC Partial Writes by removing WC order violations.

 

Case Study — YUV 4:2:0 to YUV 4:2:2 (YUY2) Output Format Conversion

Now we will look specifically at the case of WC order violations in the Rendering part of a MPEG-2 video decoder. The decoder stores video internally in the planar YUV 4:2:0 format. The data is rendered to the video overlay in the YUV 4:2:2 format corresponding to the YUY2 fourCC code. The decoder could render the data in the YV12 format corresponding to YUV 4:2:0, but YV12 is not supported on all systems, and since each row of the YUY2 format is independent, it is easier to perform deinterlacing while converting to YUY2 (deinterlacing implementation not shown here). The figure below depicts the conversion. Look at the original C code for the conversion function, YV12toYUY2.

YUV 4:2:0 Sample Locations -> YUV 4:2:2 Sample Locations

void YV12toYUY2(BYTE *curY, BYTE *curU, BYTE *curV,

BYTE *pDst, int XSize, int YSize,

int srcpitch, int dstpitch /* bytes wide for YUY2 surface */) {

int row, col;

int dstpadbytes = dstpitch - 2*XSize;

int srcpadbytes = srcpitch - XSize;


for (row=0; row < (int)YSize; row += 2) {

// Original, partial writes

for (col=0; col < (int)XSize; col += 2) {

// first row, YUYV

*pDst = *curY;

*(pDst+1) = *curU;

*(pDst+2) = *(curY+1);

*(pDst+3) = *curV;


// second row, YUYV

[ *(pDst+dstpitch) = *(curY+srcpitch);

\\WC ORDER VIOLATION!

[ *(pDst+dstpitch + 1) = *curU;

&nbsp; [ *(pDst+dstpitch + 2) = *(curY+srcpitch+1);

[ *(pDst+dstpitch + 3) = *curV;


pDst += 4;

curY += 2;

curU++;

curV++;

}

// output at end of first row,

// jump to start of third row

pDst += dstpadbytes + dstpitch;

curY += srcpadbytes + srcpitch;

curU += srcpadbytes >> 1;

curV += srcpadbytes >> 1;

}

}

 

Notice the WC order violation in the code above. There are four bytes written to the first row and then four bytes to the second row. Because the AGP memory cannot keep up with the rate that faster Pentium® 4 processor-based systems can write the data, this causes early write-combining buffer evictions and the dreaded WC Partial writes. The code below shows how the order violations are removed by using a temporary buffer in WB memory.

void YV12toYUY2_tmp(BYTE *curY, BYTE *curU, BYTE *curV,

BYTE *pDst, BYTE *pTmp, int XSize, int YSize,

int srcpitch, int dstpitch /* bytes wide for YUY2 surface */) {

int row, col;

int dstpadbytes = dstpitch - 2*XSize;

int srcpadbytes = srcpitch - XSize;

BYTE *plbuf;


for (row=0; row < (int)YSize; row += 2) {

plbuf = pTmp;

for (col=0; col < (int)XSize; col += 2) {

// first row, YUYV

*pDst = *curY;

*(pDst+1) = *curU;

*(pDst+2) = *(curY+1);

*(pDst+3) = *curV;


[ // second row, YUYV

[ *plbuf = *(curY+XSize);

\\NO WC Order Violation

[ *(plbuf+1) = *curU;

[ *(plbuf+2) = *(curY+XSize+1);

[ *(plbuf+3) = *curV;


pDst += 4;

plbuf += 4;

curY += 2;

curU++;

curV++;

}

pDst += dstpadbytes;

memcpy(pDst, pTmp, XSize*2);

// output at end of first row,

// jump to start of third row

pDst += dstpitch;

curY += XSize;

}

}

\\See Apendix 1 for an SSE-2 optimized version of the above function.

 

Further Optimization

Though the internal format of the MPEG decoder's data is YUV 4:2:0, it is not the best format to use if when optimizing for performance on the Pentium® 4 processor with SSE-2. Motion Compensation (MC) takes a significant portion of the time of a MPEG-2 decoder, and can be optimized using the 16-byte integer SIMD instruction in SSE-2. The Y data is processed in 16x16 blocks, matching the 16-byte instructions perfectly. The U and V data, however, only contain 8 bytes in each row of the 8x8 U and V blocks. In order to process the U and V data just as efficiently as the Y data, convert the internal data format to the "Combined-UV" format. This means that instead of a plane of U data and a plane of V data, there is one plane of UV data. To do this, store the data with the U and V bytes interleaved (U V U V U V…) Then in MC, process 16 bytes of UV data per row of 16x8 UV blocks. Since the internal data is now in a format that no video hardware recognizes, it needs to be converted. The conversion is very similar to YV12toYUY2, however, and a SSE-2 optimized YCUVtoYUY2 function is included in Appendix 2. In the table in the Performance Summary section, note that the Combined-UV conversion is actually faster than the regular YV12toYUY2.

This Combined-UV optimization is a good way to get a small speedup on a codec that has already been optimized using SSE-2 instructions. In the test case below, the speedup is 4%, which may seem small, but it is significant beca use the original code is already well optimized such that it is very difficult to get any additional speedup. The Combined-UV method helps by allowing for the use of 16-byte SIMD integer instructions, and by providing a more efficient pattern of memory accesses.

MPEG-2 Decoder mode
Overall FPS
Speedup
YUV 4:2:0, 3 planes
121.4
YUV 4:2:0, Combined UV (2 planes)
125.7
1.04

 

Performance Summary

The table below shows the performance of the versions of the YV12toYUY2 function discussed above, measured on a 2.4GHz Pentium® 4 processor. The SSE-2 optimized version runs over 40% faster with the WC order violation removed. Note that the C version with the order violation removed is slower because it calls the memcpy function, which is very inefficient.

YV12toYUY2
Frames per second
Original C version
625
C version, removed WC

order violation
389
SSE-2 version
564
SSE-2 version, removed

WC order violation
808
SSE-2 version, Combined-UV,

no WC order violation
841

 

The figure below shows how the SSE-2 version of YV12toYUY2 performs with and without the WC order violation. Notice that with the WC order violation (top line), it is actually slower at 2.0GHz than at 1.7GHz. The bottom line shows that the fully optimized version is limited by memory, and gets little to no speedup as frequency increases.

YV12toYUY2 Runtime - With and Without WC Order Violation


Appendix 1

SSE-2 Optimized YV12 to YUY2 Conversion Function

void YV12toYUY2_SSE2_tmp(BYTE *curY, BYTE *curU, BYTE *curV,

BYTE *pDst, BYTE *pTmp, int XSize, int YSize,

int srcPitch, int dstPitch /* bytes wide for YUY2 surface */) {

int row, col;

int XSize_2 = XSize >> 1;

int srcPitch_2 = srcPitch >> 1;


__m128i vzero;

__m128i vtmp0, vtmp1, vtmp2, vtmp3, vtmp4, vtmp5, vtmp6;


vzero = _mm_setzero_si128();


for (row=0; row < YSize; row += 2) {

// watch for buffer size issues

for (col=0; col < XSize_2; col += 16) {

// Load 16 Y's, row 0

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col));

vtmp1 = _mm_loadl_epi64((__m128i*)(curU+col));

// Load 8 U's

vtmp2 = _mm_loadl_epi64((__m128i*)(curV+col));

// Load 8 V's

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch));

// Load 16 Y's, row 1


vtmp3 = vtmp0;

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);

// __U7__U6__U5__U4__U3__U2__U1__U0

vtmp2 = _mm_unpacklo_epi8(vtmp2, vzero);

// __V7__V6__V5__V4__V3__V2__V1__V0

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8


vtmp4 = vtmp1;

vtmp1 = _mm_unpacklo_epi16(vtmp1, vzero);

// ______U3______U2______U1______U0

vtmp5 = vtmp2;

vtmp2 = _mm_unpacklo_epi16(vtmp2, vzero);

// ______V3______V2______V1______V0


vtmp4 = _mm_unpackhi_epi16(vtmp4, vzero);

// ______U7______U6______U5______U4

vtmp5 = _mm_unpackhi_epi16(vtmp5, vzero);

// ______V7______V6______V5______V4


vtmp1 = _mm_slli_epi32(vtmp1, 8);

// ____U3______U2______U1______U0__

vtmp2 = _mm_slli_epi32(vtmp2, 24);

// V3______V2______V1______V0______

vtmp4 = _mm_slli_epi32(vtmp4, 8);

// ____U7______U6______U5______U4__

vtmp5 = _mm_slli_epi32(vtmp5, 24);

// V7______V6______V5______V4______


// All 8 xmm regs used

vtmp0 = _mm_or_si128(vtmp0, vtmp1);

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp4);

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8

vtmp0 = _mm_or_si128(vtmp0, vtmp2);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp5);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


_mm_stream_si128((__m128i*)(pDst+4*col), vtmp0);

// store first 8 pixels of row 0

vtmp0 = vtmp6;

_mm_stream_si128((__m128i*)(pDst+4*col+16), vtmp3);

// store second 8 pixels of row 0


vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1

vtmp0 = _mm_unpackhi_epi8(vtmp0, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1


vtmp6 = _mm_or_si128(vtmp6, vtmp1);

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0

vtmp0 = _mm_or_si128(vtmp0, vtmp4);

// __YFU7YE__YDU6YC__YBU5YA__
Y9U4Y8

vtmp6 = _mm_or_si128(vtmp6, vtmp2);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp0 = _mm_or_si128(vtmp0, vtmp5);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


// store first 8 pixels of row 1

_mm_store_si128((__m128i*)(pTmp+4*col), vtmp6);

// store second 8 pixels of row 1

_mm_store_si128((__m128i*)(pTmp+4*col+16), vtmp0);


// ------------ Second set ---------------

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col+16));

// Load 16 Y's, row 0

vtmp1 = _mm_loadl_epi64((__m128i*)(curU+col+8));

// Load 8 U's

vtmp2 = _mm_loadl_epi64((__m128i*)(curV+col+8));

// Load 8 V's

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch+16));

// Load 16 Y's, row 1


vtmp3 = vtmp0;

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);

// __U7__U6__U5__U4__U3__U2__U1__U0

vtmp2 = _mm_unpacklo_epi8(vtmp2, vzero);

// __V7__V6__V5__V4__V3__V2__V1__V0

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8


vtmp4 = vtmp1;

vtmp1 = _mm_unpacklo_epi16(vtmp1, vzero);

// ______U3______U2______U1______U0

vtmp5 = vtmp2;

vtmp2 = _mm_unpacklo_epi16(vtmp2, vzero);

// ______V3______V2______V1______V0


vtmp4 = _mm_unpackhi_epi16(vtmp4, vzero);

// ______U7______U6______U5______U4

vtmp5 = _mm_unpackhi_epi16(vtmp5, vzero);

// ______V7______V6______V5______V4


vtmp1 = _mm_slli_epi32(vtmp1, 8);

// ____U3______U2______U1______U0__

vtmp2 = _mm_sll
i_epi32(vtmp2, 24);

// V3______V2______V1______V0______

vtmp4 = _mm_slli_epi32(vtmp4, 8);

// ____U7______U6______U5______U4__

vtmp5 = _mm_slli_epi32(vtmp5, 24);

// V7______V6______V5______V4______


// All 8 xmm regs used

vtmp0 = _mm_or_si128(vtmp0, vtmp1);

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp4);

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8

vtmp0 = _mm_or_si128(vtmp0, vtmp2);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp5);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


_mm_stream_si128((__m128i*)(pDst+4*col+32), vtmp0);

// store first 8 pixels of row 0

vtmp0 = vtmp6;

_mm_stream_si128((__m128i*)(pDst+4*col+48), vtmp3);

// store second 8 pixels of row 0


vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1

vtmp0 = _mm_unpackhi_epi8(vtmp0, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1


vtmp6 = _mm_or_si128(vtmp6, vtmp1);

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0

vtmp0 = _mm_or_si128(vtmp0, vtmp4);

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8

vtmp6 = _mm_or_si128(vtmp6, vtmp2);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp0 = _mm_or_si128(vtmp0, vtmp5);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


_mm_store_si128((__m128i*)(pTmp+4*col+32), vtmp6);

// store first 8 pixels of row 1

_mm_store_si128((__m128i*)(pTmp+4*col+48), vtmp0);

// store second 8 pixels of row 1

}

pDst += dstPitch;

// copy temp buffer

// watch for buffer size issues

for (col=0; col < XSize_2; col += 16) {

vtmp0 = _mm_load_si128((__m128i*)(pTmp+4*col));

vtmp1 = _mm_load_si128((__m128i*)(pTmp+4*col+16));

vtmp2 = _mm_load_si128((__m128i*)(pTmp+4*col+32));

vtmp3 = _mm_load_si128((__m128i*)(pTmp+4*col+48));


_mm_stream_si128((__m128i*)(pDst+4*col), vtmp0);

_mm_stream_si128((__m128i*)(pDst+4*col+16), vtmp1);

_mm_stream_si128((__m128i*)(pDst+4*col+32), vtmp2);

_mm_stream_si128((__m128i*)(pDst+4*col+48), vtmp3);

}

curY += 2*srcPitch;

curU += srcPitch_2;

curV += srcPitch_2;

pDst += dstPitch;

}

_mm_sfence();

}

 


Appendix 2

SSE-2 Optimized Combined-UV to YUY2 Conversion Function

void YCUVtoYUY2_SSE2_64B_tmp(BYTE *curY, BYTE *curUV, BYTE *pDst,

BYTE *pTmp, int XSize, int YSize,

int srcPitch, int dstPitch /* bytes wide for YUY2 surface */) {

int row, col;

int XSize_2 = XSize >> 1;

int srcPitch_2 = srcPitch >> 1;


__m128i vzero;

__m128i vtmp0, vtmp1, vtmp2, vtmp3, vtmp4, vtmp6;


vzero = _mm_setzero_si128();


for (row=0; row < YSize; row += 2) {

// watch for buffer size issues

for (col=0; col < XSize_2; col += 16) {

if ((col & 63) == 0) {

_mm_prefetch((const char *)(curY+2*col+8*128), _MM_HINT_NTA);

_mm_prefetch((const char *)(curUV+2*col+8*128), _MM_HINT_NTA);

_mm_prefetch((const char *)(curY + 2*col + srcPitch + 8*128), _MM_H
INT_NTA);

}

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col));

// Load 16 Y's, row 0

vtmp1 = _mm_load_si128((__m128i*)(curUV+2*col));

// Load 8 UV's

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch));

// Load 16 Y's, row 1


vtmp2 = vtmp1;

// V7U7V6U6V5U5V4U4V3U3V2U2V1U1V0U0

vtmp3 = vtmp0;

vtmp4 = vtmp6;

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);

// __V3__U3__V2__U2__V1__U1__V0__U0

vtmp2 = _mm_unpackhi_epi8(vtmp2, vzero);

// __V7__U7__V6__U6__V5__U5__V4__U4

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8


vtmp1 = _mm_slli_epi32(vtmp1, 8);

// V3__U3__V2__U2__V1__U1__V0__U0__

vtmp2 = _mm_slli_epi32(vtmp2, 8);

// V7__U7__V6__U6__V5__U5__V4__U4__

vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1

vtmp4 = _mm_unpackhi_epi8(vtmp4, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1


// All 8 xmm regs used

vtmp0 = _mm_or_si128(vtmp0, vtmp1);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp2);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8

vtmp6 = _mm_or_si128(vtmp6, vtmp1);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp4 = _mm_or_si128(vtmp4, vtmp2);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


_mm_stream_si128((__m128i*)(pDst+4*col), vtmp0);

// store first 8 pixels of row 0

_mm_stream_si128((__m128i*)(pDst+4*col+16), vtmp3);

// store second 8 pixels of row 0

_mm_store_si128((__m128i*)(pTmp+4*col), vtmp6);

// store first 8 pixels of row 1

_mm_store_si128((__m128i*)(pTmp+4*col+16), vtmp4);

// store second 8 pixels of row 1


// ------------ Second set ---------------

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col+16));

// Load 16 Y's, row 0

vtmp1 = _mm_load_si128((__m128i*)(curUV+2*col+16));

// Load 8 UV's

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch+16));

// Load 16 Y's, row 1


vtmp2 = vtmp1;

// V7U7V6U6V5U5V4U4V3U3V2U2V1U1V0U0

vtmp3 = vtmp0;

vtmp4 = vtmp6;

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);

// __V3__U3__V2__U2__V1__U1__V0__U0

vtmp2 = _mm_unpackhi_epi8(vtmp2, vzero);

// __V7__U7__V6__U6__V5__U5__V4__U4

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8


vtmp1 = _mm_slli_epi32(vtmp1, 8);

// V3__U3__V2__U2__V1__U1__V0__U0__

vtmp2 = _mm_slli_epi32(vtmp2, 8);

// V7__U7__V6__U6__V5__U5__V4__U4__

vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1

vtmp4 = _mm_unpackhi_epi8(vtmp4, vzero);

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1


// All 8 xmm regs used

vtmp0 = _mm_or_si128(vtmp0, vtmp1);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp3 = _mm_or_si128(vtmp3, vtmp2);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8

vtmp6 = _mm_or_si128(vtmp6, vtmp1);

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0

vtmp4 = _mm_or_si128(vtmp4, vtmp2);

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8


_mm_stream_si128((__m128i*)(pDst+4*col+32), vtmp0);

// store first 8 pixels of row 0

_mm_stream_si128((__m128i*)(pDst+4*col+48), vtmp3);

// store second 8 pixels of row 0

_mm_store_si128((__m128i*)(pTmp+4*col+32), vtmp6);

// store first 8 pixels of row 1

_mm_store_si128((__m128i*)(pTmp+4*col+48), vtmp4);

// store second 8 pixels of row 1

}

pDst += dstPitch;

// copy temp buffer

// watch for buffer size issues

for (col=0; col < XSize_2; col += 16) {

vtmp0 = _mm_load_si128((__m128i*)(pTmp+4*col));

vtmp1 = _mm_load_si128((__m128i*)(pTmp+4*col+16));

vtmp2 = _mm_load_si128((__m128i*)(pTmp+4*col+32));

vtmp3 = _mm_load_si128((__m128i*)(pTmp+4*col+48));


_mm_stream_si128((__m128i*)(pDst+4*col), vtmp0);

_mm_stream_si128((__m128i*)(pDst+4*col+16), vtmp1);

_mm_stream_si128((__m128i*)(pDst+4*col+32), vtmp2);

_mm_stream_si128((__m128i*)(pDst+4*col+48), vtmp3);

}

curY += 2*srcPitch;

curUV += srcPitch;

pDst += dstPitch;

}

_mm_sfence();

}