I was trying these two code snippets. My arrays are all 16 byte aligned:

inline void vecDotSSE(double * s, double * x, double * y, int n)

{

int ii;

__m128d XMM0 = _mm_setzero_pd();

__m128d XMM1 = _mm_setzero_pd();

__m128d XMM2, XMM3, XMM4, XMM5;

for (ii = 0;ii < (n);ii += 4)

{

XMM2 = _mm_load_pd((x)+ii);

XMM3 = _mm_load_pd((x)+ii+2);

XMM4 = _mm_load_pd((y)+ii);

XMM5 = _mm_load_pd((y)+ii+2);

XMM2 = _mm_mul_pd(XMM2, XMM4);

XMM3 = _mm_mul_pd(XMM3, XMM5);

XMM0 = _mm_add_pd(XMM0, XMM2);

XMM1 = _mm_add_pd(XMM1, XMM3);

}

XMM0 = _mm_add_pd(XMM0, XMM1);

XMM1 = _mm_shuffle_pd(XMM0, XMM0, _MM_SHUFFLE2(1, 1));

XMM0 = _mm_add_pd(XMM0, XMM1);

_mm_store_sd((s), XMM0);

}

inline void vecDot(double * s, double * x, double * y, int n)

{

int i;

*s = 0.;

for (i = 0;i < n;++i)

{

*s += x[i] * y[i];

}

}

My compile flags:

g++ -Wall -O3 -msse3

These are my runtime numbers on vector of size 1M

SSE : 0.0263s

Non-SSE : 1.87996e-07

Does that even make sense ??

I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.

Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.

Thanks.

Deb