SSE2 slower than MMX/SSE(1)

SSE2 slower than MMX/SSE(1)

OK, here is two functions,
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?

I've compiled with and without profile guided complation. results are the same. SSE2 is slower.

SSE2 function:

int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e=_mm_setzero_si128();
for (i=0;i<16;i++) {
__m128i unaligned=_mm_loadu_si128((__m128i*)pSrc);
e=_mm_add_epi16(e,_mm_sad_epu8(((__m128i*)pDst)[0],unaligned));
pDst+=w_Dst;
pSrc+=w_Src;
}
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);
result=_mm_cvtsi128_si32(e);
_mm_empty();
return result;
}

SSE/MMX function:

int MMX_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m64 e0=0,e1=0;
for (i=0;i<16;i++) {
e0=_mm_add_pi32(e0,_mm_sad_pu8(((__m64*)pDst)[0],((__m64*)pSrc)[0]));
e1=_mm_add_pi32(e1,_mm_sad_pu8(((__m64*)pDst)[1],((__m64*)pSrc)[1]));
pDst+=w_Dst;
pSrc+=w_Src;
}
e1=_mm_add_pi32(e1,e0);
result=_m_to_int(e1);
_mm_empty();
return result;
}

Thanks,
Alex Telitsine
Streambox Inc.

14 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Just in case, to present what plain C code does:

int C_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,err=0;
for (i=0;i<16;i++) {
err+=ABS(pDst[ 0]-pSrc[ 0]);
err+=ABS(pDst[ 1]-pSrc[ 1]);
err+=ABS(pDst[ 2]-pSrc[ 2]);
err+=ABS(pDst[ 3]-pSrc[ 3]);
err+=ABS(pDst[ 4]-pSrc[ 4]);
err+=ABS(pDst[ 5]-pSrc[ 5]);
err+=ABS(pDst[ 6]-pSrc[ 6]);
err+=ABS(pDst[ 7]-pSrc[ 7]);
err+=ABS(pDst[ 8]-pSrc[ 8]);
err+=ABS(pDst[ 9]-pSrc[ 9]);
err+=ABS(pDst[10]-pSrc[10]);
err+=ABS(pDst[11]-pSrc[11]);
err+=ABS(pDst[12]-pSrc[12]);
err+=ABS(pDst[13]-pSrc[13]);
err+=ABS(pDst[14]-pSrc[14]);
err+=ABS(pDst[15]-pSrc[15]);
pDst+=w_Dst;
pSrc+=w_Src;
}
return err;
}

and, for Intel platform, there are following definitions:

#ifdef __ICL // intel compiler
#define _X86_COMPATABLE_CPU_
#define RESTRICT restrict // -Qrestrict option should be ON
#define ABS(i) abs(i)
#ifndef INLINE
#define INLINE _inline // inline is always available on MMX
#endif
#define CAN_RW_UNALIGNED
#endif
.......
#ifndef BYTE
typedef unsigned char BYTE;
#endif

If you are incurring cache line splits in your SSE2 code, reduced performance is to be expected. I don't see that PGO is likely to do anything for this.

Well, according to SSE2 application notes, I should get x1.26 speed up over SSE/MMX code:

ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf

Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?
Anyway, I'll try to see what's going on in VTune today.

OK, problem was in the last lines of the SSE2 code,
only half of result was in use, and it did cause motion search to perform more block comparisons:
e = _mm_srli_si128(e, 8);
e = _mm_add_epi32 (e, e);

Unrolling the loop and interleaving e0/e1 in the loop gave a little improvement as well.

Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.

below is final version of SSE2 code:

int SSE2_Copy16x16NA_E(BYTE* RESTRICT pSrc,BYTE* RESTRICT pDst,int w_Src,int w_Dst)
{
int i,result;
__m128i e0,e1;

e0= _mm_sad_epu8(*(__m128i*)(pDst +( 0)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 0)*w_Src)));
e1= _mm_sad_epu8(*(__m128i*)(pDst +( 1)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 1)*w_Src)));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 2)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 2)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 3)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 3)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 4)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 4)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 5)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 5)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 6)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 6)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 7)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 7)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +( 8)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 8)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +( 9)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ ( 9)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(10)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (10)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(11)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (11)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(12)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (12)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(13)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (13)*w_Src))));
e0=_mm_add_epi16(e0,_mm_sad_epu8(*(__m128i*)(pDst +(14)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (14)*w_Src))));
e1=_mm_add_epi16(e1,_mm_sad_epu8(*(__m128i*)(pDst +(15)*w_Dst),_mm_loadu_si128((__m128i*)(pSrc+ (15)*w_Src))));
e0 = _mm_add_epi32 (e0, e1);
e0 = _mm_add_epi32 (e0, _mm_srli_si128(e0, 8));
result=_mm_cvtsi128_si32(e0);
_mm_empty();
return result;
}

Alex

>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks.

 

how did you arrive at the number of clock cycles? 

Thanks in advance

Regards,

kiran

 

引文:

kiran N. 写道:

>>Overall SSE2 gave x 1.27 improvement in clock ticks:

SSE-1969 clocks, SSE2-1542 clocks.

 

how did you arrive at the number of clock cycles?

Thanks in advance

Regards,

kiran

 

Probably by using _asm rdtsc  or __rdtsc() intrinsic instruction.

引文:

alex-telitsine 写道:

ftp://download.intel.com/design/perftool/cbts/appnotes/sse2/w_me_alg.pdf

Regarding cache line split, the application note says that unaligned load should be still faster then loading two 64 bits values.
Could it be that application note is not "entirely true"?
 

On the Westmere CPU, I got consistently better results by splitting 128-bit loads, which some compilers did automatically when using source code, in a case where 50% of the loads were unaligned.  I could never verify the often stated recommendation to build with SSE4.2 option for Westmere, when the earlier architecture options frequently use specific strategies for unaligned loads.  This difference may have gone away for 128-bit loads on Sandy Bridge, but there the usual splitting of 256-bit unaligned loads is quite important.  Even on the Ivy Bridge, where a late fix went in to reduce the penalty for 256-bit unaligned loads, my SSE2 intrinsics run faster than the AVX ones, for this reason (with the AVX transition penalties handled either by AVX-128 translation or explicit vzeroupper).  That has changed with the Haswell CPUs.

It's difficult to infer this from VTune, unless you can count cache line splits and correlate them with stalls on instructions which consume the results of memory loads.

Anyway, the assumption which seems to be made here that the specific CPU architecture or stepping can be ignored when considering the effect of misalignment is not a good one.

>>>OK, here is two functions,
one is SSE (P3/MMX et cetera), another is SSE2.
SSE2 is performing slower. Any advice why?>>>

Try to post your question on ISA forum.

>>>Overall SSE2 gave x 1.27 improvement in clock ticks:
SSE-1969 clocks, SSE2-1542 clocks>>>

In this case SSE2 version does not relay on any for loop logic hence without the overhead of compiled for-loop instruction the code executed faster.Bear in mind that your code operates on SIMD vector operations and integer scalar operations, although Haswell can probably schedule loop instruction to be executed on Port6 thus freeing resources to operate on SIMD ALU.Here also arises interesting question related to low level implementation of integer ALU vs. SIMD ALU.I would like to know if   integer scalar code is executed by the same ALU as SIMD integer code.

Thanks for the info.. was looking for it from long time :)

>>>Thanks for the info.. was looking for it from long time :)>>>

You are welcome.

Try to replace the _mm_loadu_si128 with _mm_lddqu_si128, you might have better performances in your case (purelly empirical).

And I guess _mm_empty is useless if you don't use MMX anymore in that SSE2 function.

Wow, that's a blast from the past , 2003 email? Yes, _mm_lddqu_si128 is used for SSE3 and higher code :-)

发表评论

登录添加评论。还不是成员?立即加入