Im experimenting a little bit with the SSE2 instructions and IPP. To add 2 complex vectors i have written the following procedure:
void AddSSE(const float* ar,const float* ai,const float* br,const float* bi,float* cr,float* ci)
__m128* aiSSE = (__m128*)ai;
__m128* biSSE = (__m128*)bi;
__m128* ciSSE = (__m128*)ci;
__m128* arSSE = (__m128*)ar;
__m128* brSSE = (__m128*)br;
__m128* crSSE = (__m128*)cr;
ciSSE[k] = _mm_add_ps(aiSSE[k],biSSE[k]);
crSSE[k] = _mm_add_ps(arSSE[k],brSSE[k]);
where the vectors ar,ai ... are length 1024. then I compute the same add using IPP functions:
void AddIPP(const Ipp32fc* a,const Ipp32fc* b,Ipp32fc* c)
The IPP version goes apporx 10 times faster than the SSE2 version o wrote. What am i doing wrong here? what can I do to speed this up. When computing real valued adds and muls, I am able to make the SSE run as fast as the IPP, but the complex ones, I am far off.
why is this so slow.....