I need to efficiently compute the element sum of a double precision vector (a+a+..a[n-1]) . Is there a routine in MKL for this. The BLAS ?asum compute sum of the magnitudes, unfortunately.
Intel compiler optimizations do this effectively.
I am using Intel 9.0, so I gather you are suggesting just a simple "for" loop. Any specific optimization directives I should use?This does seem like multithreading/paralellization would help here as well...
A loop such asfor(int i=0, sum=0;i < n;++n)sum += a[i];(with sum as a local variable declared the same type as a)should optimize easily. For example, on Xeon or P4, use optionsicc -O -xWor, for an SSE3 machine -xP.-O1 may be superior to -O2 for loops of moderate length.
As a little test, I tried this on a Pentium D with /Qopenmp and OMP_NUM_THREADS=2 and saw 100 percent CPU usage. Very nice...
The build log window did say OpenMP defined loop was parallelized
const double *data=s.data();
#pragma omp parallel for reduction(+:result)
for (int i=0;i