I ran McCalpin's stream benchmark on the 3.4 GHZ Xeon and got 4759 MB/s for Triad. The original code was inlined and the operations were directly on the global arrays. I added a similar function but instead of directly using the global array, I passed in the arrays. The bandwidth for the new Triad function was only 3400 MB/s. Why did I lose so much bandwidth? The opt-report indicated both functions were inlined.
#define N 2000000
static double * a ;
static double * b ;
static double * c ;
int main()
{
...
times[3][k] = mysecond();
tuned_STREAM_Triad(a,b,c,scalar); // original function, inlined 4759 MB/s
times[3][k] = mysecond() - times[3][k];
times[4][k] = mysecond();
tuned_STREAM_Triad_Arg(a,b,c,scalar); // new function, inlined 3400 MB/s
times[4][k] = mysecond() - times[4][k];
return 0;
}
void tuned_STREAM_Triad(double* aa,double* bb,double* cc,double scalar)
{
int j;
#pragma omp parallel for
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
}
void tuned_STREAM_Triad_Arg(double* restrict aa,double* bb,double* cc,double scalar)
{
int j;
#pragma omp parallel for
#pragma ivdep
for (j=0; j<N; j++)
aa[j] = bb[j]+scalar*cc[j];
}
Compiled with
icc -openmp -restrict
A couple of comments:
First, can we assume the proper amount of memory was allocate for a, b, c? (N doubles each)
Second, you might check the dissassembly code to see if register pressure cause the pointers to one or more of aa, bb, cc to be refetched from memory as opposed to remaining cached.
Jim Dempsey