Unexpected slowing

Unexpected slowing

Having fun with section() and replace(), but have stumbled on unexpected slowing. The ArBB function, simplified below (to protect the guilty, but still exhibits the behavior), is like a red-black Jacobi relaxation iteration. Alternatenon-overlapping "halves" of the state vector "A" get updated.

void tick_tock(dense &A, const dense &B, const usize &N)
{
A = replace(A, 0*N, 2*N, 1,
section(A, 2*N, 2*N, 1)*section(B, 2*N, 2*N, 1) +
section(A, 3*N, 2*N, 1)*section(B, 0*N, 2*N, 1)
);
A = replace(A, 3*N, 1*N, 1,
section(A, 0*N, 1*N, 1)*section(B, 0*N, 1*N, 1) +
section(A, 1*N, 1*N, 1)*section(B, 4*N, 1*N, 1)
);
}

Then the main code:

#define NSEG 100000
dense adat = repeat(dense::parse("{ 0, 0, 1, 0, 0}"), NSEG, false);
dense bdat = repeat(dense::parse("{0.5,0.5,0.5,0.5 }"), NSEG, false);
int flops = ((2 + 1)*2 + (2 + 1)*1)*NSEG;
double ttime = 0.;
list times;
const closure &, const dense &, const usize &)> tt = capture(tick_tock);

#define ITER 1000
tt(adat, bdat, NSEG);
for (int i = ITER; i-- > 0; )
{
double rtime;
{
const scoped_timer timer(rtime, scoped_timer::unit_us);
tt(adat, bdat, NSEG);
}
times.push_back(rtime);
ttime += rtime;
}
times.sort();
times.resize(20);
times.sort();
for (list::reverse_iterator iter = times.rbegin(); iter != times.rend(); iter++)
printf("** time = %9.1f us\\n", *iter);
double low = *times.begin();
printf("** avg. = %9.1f us, low = %9.1f us, flops = %d, Gfps = %6.3f\\n",
ttime/ITER, low, flops, flops/(low*1e3));

On my lowly Core 2 Duo Win7, ARBB_OPT_LEVEL = O2, we get about 2Gfps which is quite impressive on 2GHz machine using one core. FYI, using O3 yields a 50% boost from the 2nd core. Note the size of NSEG puts the total data, adat & bdat, at 3.6MB thus under the 4MB L2 cache size.

Now comes the trouble...

Example #1: simulate "unrolling the loop" by duplicating the two statements in tick_tock()

A = replace(A, 0*N, 2*N, 1,...
A = replace(A, 3*N, 1*N, 1,...
A = replace(A, 0*N, 2*N, 1,...
A = replace(A, 3*N, 1*N, 1,...

and the execution time jumps by 8x!

Example #2: iterate within tick_tock() and, note this, just"loop" once

i32 i;
_for(i = 0, i < 1, i++) {
A = replace(A, 0*N, 2*N, 1,...
A = replace(A, 3*N, 1*N, 1,...
} _end_for

and the execution time jumps by 6x!Allowinga second pass through the loopgets the jump back to 8x.

However, if we create a global flag
arbb::boolean flag = true;
and enclose the two statements in an always-true if-block

_if(flag) {
A = replace(A, 0*N, 2*N, 1,...
A = replace(A, 3*N, 1*N, 1,...
} _end_if

keeping them as a unit, we can repeat the examples above without the unexpected slowing!!! FYI, using O3 and looping several times getssatisfyingly closeto a 100% boost from the 2nd core.

Baffled,
- paul

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Paul,

Many thanks for sharing with us your insight. It is a very well constructed test case. Please allow us some time to reproduce and digest your observations. I'll come back to you later. Thanks for your patience.

Zhang

My pleasure. No worries.

I can proceed with this using the dummy if() blocks that seem to keep ArBB "between the lines." I've used this fix-up before -- can't recall the context -- but glad to present it now.My opinion is ArBB "gets ahead of itself" and perhaps creates temorary containers that, combined, overflow L2 or there is some other data contention issue ... but those arejust my myths since it's a black box to me!

- paul

BTW, the stuff about keeping a list of execution times isfrom observing there is a spread of these. It seems the best approach is to run a bunch, sort, and use the shortest time as a "favorable winds" result. In practice, I reduce the timings to 3 significant digits

times.push_back(rtime <= 0. ? 0. :
(int)(rtime*pow(10., 3 - ceil(log10(rtime))) + 0.5)*pow(10., ceil(log10(rtime)) - 3)); // 3 sig. figs.

then findthe lowest3 timings (out of 1,000), ignoring zero (which happens for some shorter executions)

int count = 0;
double ltime = 0.;
for (list ::iterator iter = times.begin(); iter != times.end(); iter++)
{
if (*iter == 0.)
continue;
if (*iter == ltime)
{
if (++count >= 3)
break;
continue;
}
if (*iter > ltime)
{
ltime = *iter;
count = 1;
}
}

I'm sure there's a clever case-statement way of sifting the results, but I'm too lazy to make it.

- paul

Leave a Comment

Please sign in to add a comment. Not a member? Join today