Hello, the performance of my application heavily depends on summing two vectors (stored as aligned double arrays), namely I need a fast vecA += vecB. As with SSE one does not have instructions for +=, the only option is to have vecA = vecA + vecB. I have two versions of this function:
inline void addToDoubleVectorSSE(const double * what, const double * toWhat, volatile double * dest, const unsigned int len)
{
__m128d * _what = (__m128d*)what;
__m128d * _toWhat = (__m128d*)toWhat;
__m128d * _toWhatBase = (__m128d*)toWhat;__m128d _dest1;
__m128d _dest2;#ifdef FAST_SSE
for ( register unsigned int i = 0; i < len; i+= 4, _what += 2, _toWhat += 2, _toWhatBase+=2 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what, *_toWhat ); //line A
_dest2 = _mm_add_pd( *(_what+1), *(_toWhat+1)); //line B*_toWhatBase = _dest1;
*(_toWhatBase+1) = _dest2;
}
#else
for ( register unsigned int i = 0; i < len; i+= 4 )
{
_toWhatBase = _toWhat;
_dest1 = _mm_add_pd( *_what++, *_toWhat++ );
_dest2 = _mm_add_pd( *_what++, *_toWhat++ );*_toWhatBase++ = _dest1;
*_toWhatBase++ = _dest2;
}
#endif
}
FAST_SSE should take advantage of the independence of lines A and line B, hence should provide performance gains.
Scenario 1: Assume having arrays double * a, *b, *c each 1000 elements long. Calling addToDoubleVectorSSE(a,b,c,1000) say 10K times indeed shows that FAST_SSE version has approx. 25-30 percent faster runtime.
Scenario 2: Assume having double ** a, ** b, **c where each a,b,c consists of 1000 arrays, each array (a[i], b[i], c[i]) being 1000 elements long. Calling addToDoubleVectorSSE(a[i],b[i],c[i],1000) over i=0....999 say 10K times makes the performance gain of FAST_SSE disappear.
The question is whether the performance loss can somehow be mitigated. I understand that cache misses as probably going to be the problem. In the first scenario, all arrays a, b, c are small enough to remain in L2, which is not the case with scenario 2. Is there e.g. a way to tell the compiler something like "In two lines of code, Im gonna need arrays a[i], b[i], c[i] so if you can, prefetch them to L2"? Or is there any other workaround?
Any hint is much appreciated, Daniel.
P.S. The sample bechmark code can be downloaded from http://pastebin.com/Z1pQ6Sdp


