MMX intrinsics performed bad

MMX intrinsics performed bad

i could not understand why MMX code were slower than those in c++. results for C++ was 0.000180ms, those for MMX intrinsics was 0.000280ms.any explaination? i thought parallel addition was faster than serial addition!
#include "stdafx.h"

#include

#include

#include

int _tmain(int argc, _TCHAR* argv[])

{

UINT64 startCount, endCount, diffCount, freq;

QueryPerformanceCounter((LARGE_INTEGER*)&startCount);

QueryPerformanceCounter((LARGE_INTEGER*)&endCount);

short block[4][4] ={1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4};

int j;

// c ++ codes

/*

for(j =0;j<4;j++)

{

int s0 =block[0][j]+block[3][j];

int s3 =block[0][j]-block[3][j];

int s1 =block[1][j]+block[2][j];

int s2 =block[1][j]-block[2][j];

block[0][j]=s0+s1;

block[2][j]= s0-s1;

block[1][j]= s2+(s3<<1);

block[3][j]= s3-(s2<<1);

}

*/

// MMX codes

__m64*block2 =(__m64*)block;

__m64 s0,s1,s2,s3;

j=0;

s0 =_mm_add_pi16(block2[j],block2[3+j]);

s3 =_mm_sub_pi16(block2[j],block2[3+j]);

s1 =_mm_add_pi16(block2[1+j],block2[2+j]);

s2 =_mm_sub_pi16(block2[1+j],block2[2+j]);

block2[j]=_mm_add_pi16(s0,s1);

block2[2+j]= _mm_sub_pi16(s0,s1);

block2[1+j]= _mm_add_pi16(s2,(_mm_slli_pi16(s3,1)));

block2[3+j]= _mm_sub_pi16(s3,(_mm_slli_pi16(s2,1)));

_mm_empty();

diffCount = endCount - startCount;

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

double exeTime_in_ms = (double)diffCount * 1000.0 / freq;

printf("Executing time : %fms\\n", exeTime_in_ms);

return 0;

}

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It is really hard to measure such a short time precisely. I suggest that you put a loop around your code and execute it 1000 times.

Furthermore, I suggest that you have a look at the generated assembly code to verify that the compiler generates the code that you expect.

Leave a Comment

Please sign in to add a comment. Not a member? Join today