I have a code clip below
BYTE* pD; BYTE *pU; BYTE *pV;
int n;
for (n=0; n<size; n++) {
*pD++ = *pY++;
*pD++ = *pU++; pU++;
*pD++ = *pY++;
*pD++ = *pV++; pV++;
}
I tried to rewrite it by SSE2, however I do not know how to combine bytes by SSE. Any help?
BTW I simply tested following SSE code but found that CPU usage get worse in my program from 80% to 95%. I have thought that the use of SSE can improve CPU usage. The speed just improve a little bit. Where are things going wrong? I am running on Windows 8 Core i7 ultrabook.
int n;
__m128i tmp;
for (n=0; n<size; n+=16;) {
tmp = _mm_load_si128((__m128i *)pY);
_mm_store_si128((__m128i *)pD, tmp);
pD+=32; pY+=16; pU+=8; pV+=8;
}




