Stream benchmark performance
I ran McCalpin's stream benchmark on the 3.4 GHZ Xeon and got 4759 MB/s for Triad. The original code was inlined
and the operations were directly on the global arrays. I added a similar function but instead of directly using the
global array, I passed in the arrays. The bandwidth for the new Triad function was only 3400 MB/s. Why did I lose so
much bandwidth? The opt-report indicated both functions were inlined.
#define N 2000000
static
double * a
;
static
double * b
;
static
double * c ;
int
main()
{
...
times[3][k] = mysecond();
tuned_STREAM_Triad(a,b,c,scalar); // original function,
inlined 4759 MB/s
times[3][k] = mysecond() - times[3][k];
times[4][k] = mysecond();
tuned_STREAM_Triad_Arg(a,b,c,scalar); // new function,
inlined 3400 MB/s
times[4][k] = mysecond() - times[4][k];
return 0;
}
void
tuned_STREAM_Triad(double* aa,double* bb,double* cc,double scalar)
{
int j;
#pragma
omp parallel for
for (j=0; j<N; j++)
a[j] =
b[j]+scalar*c[j];
}
void
tuned_STREAM_Triad_Arg(double* restrict aa,double* bb,double* cc,double scalar)
{
int j;
#pragma
omp parallel for
#pragma
ivdep
for (j=0; j<N; j++)
aa[j] = bb[j]+scalar*cc[j];
}
Compiled with
icc
-openmp -restrict
|