MMX intrinsics performed bad

MMX intrinsics performed bad

i could not understand why MMX code were slower than those in c++. results for C++ was 0.000180ms, those for MMX intrinsics was 0.000280ms.any explaination? i thought parallel addition was faster than serial addition!
#include "stdafx.h"




int _tmain(int argc, _TCHAR* argv[])


UINT64 startCount, endCount, diffCount, freq;



short block[4][4] ={1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4};

int j;

// c ++ codes


for(j =0;j<4;j++)


int s0 =block[0][j]+block[3][j];

int s3 =block[0][j]-block[3][j];

int s1 =block[1][j]+block[2][j];

int s2 =block[1][j]-block[2][j];


block[2][j]= s0-s1;

block[1][j]= s2+(s3<<1);

block[3][j]= s3-(s2<<1);



// MMX codes

__m64*block2 =(__m64*)block;

__m64 s0,s1,s2,s3;


s0 =_mm_add_pi16(block2[j],block2[3+j]);

s3 =_mm_sub_pi16(block2[j],block2[3+j]);

s1 =_mm_add_pi16(block2[1+j],block2[2+j]);

s2 =_mm_sub_pi16(block2[1+j],block2[2+j]);


block2[2+j]= _mm_sub_pi16(s0,s1);

block2[1+j]= _mm_add_pi16(s2,(_mm_slli_pi16(s3,1)));

block2[3+j]= _mm_sub_pi16(s3,(_mm_slli_pi16(s2,1)));


diffCount = endCount - startCount;


double exeTime_in_ms = (double)diffCount * 1000.0 / freq;

printf("Executing time : %fms\\n", exeTime_in_ms);

return 0;


2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It is really hard to measure such a short time precisely. I suggest that you put a loop around your code and execute it 1000 times.

Furthermore, I suggest that you have a look at the generated assembly code to verify that the compiler generates the code that you expect.

Leave a Comment

Please sign in to add a comment. Not a member? Join today