I was trying these two code snippets. My arrays are all 16 byte aligned:
inline void vecDotSSE(double * s, double * x, double * y, int n)
{
int ii;
__m128d XMM0 = _mm_setzero_pd();
__m128d XMM1 = _mm_setzero_pd();
__m128d XMM2, XMM3, XMM4, XMM5;
for (ii = 0;ii < (n);ii += 4)
{
XMM2 = _mm_load_pd((x)+ii);
XMM3 = _mm_load_pd((x)+ii+2);
XMM4 = _mm_load_pd((y)+ii);
XMM5 = _mm_load_pd((y)+ii+2);
XMM2 = _mm_mul_pd(XMM2, XMM4);
XMM3 = _mm_mul_pd(XMM3, XMM5);
XMM0 = _mm_add_pd(XMM0, XMM2);
XMM1 = _mm_add_pd(XMM1, XMM3);
}
XMM0 = _mm_add_pd(XMM0, XMM1);
XMM1 = _mm_shuffle_pd(XMM0, XMM0, _MM_SHUFFLE2(1, 1));
XMM0 = _mm_add_pd(XMM0, XMM1);
_mm_store_sd((s), XMM0);
}
inline void vecDot(double * s, double * x, double * y, int n)
{
int i;
*s = 0.;
for (i = 0;i < n;++i)
{
*s += x[i] * y[i];
}
}
My compile flags:
g++ -Wall -O3 -msse3
These are my runtime numbers on vector of size 1M
SSE : 0.0263s
Non-SSE : 1.87996e-07
Does that even make sense ??
I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.
Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.
Thanks.
Deb


