By Arch D. Robison (Intel), Added

I recently updated my video game Frequon Invaders. It's a free download from http://home.comcast.net/~arch.robison/frequon.html , which is strictly my own product, not Intel's. In doing the update, I optimized it for Intel® Core™2 Duo processor, and ran into a tale of dependence breaking that I'll tell here.

Frequon Invaders has a very unusual display for a video game - a Discrete Fourier Transform (DFT). For that matter, not many video games explicitly involve complex numbers either. The DFT consumes most of the cycles and thus has critical impact on the game's frame rate. Here's a sketch of the DFT:

foreach scanline {

array sum = all zeros;

foreach frequon {

complex s = ...;

complex v =...;

for( i=0; i<pixelsPerScanline; ++i ) {

sum[i] += s;

s *= v;

}

}

color-map sum into the scanline

}

Because the += and *= are on complex numbers, each inner loop iteration requires 4 real additions and 4 real multiplications. There are 1-14 frequons. Screens usually have ~1000 scanlines and ~1000-2000 pixels per scanline. Choosing the pixel loop as the inner loop has the advantage of permitting a long inner loop with excellent cache reuse. It essentially uses the L1 cache to accumulate the array sum.

With the scalar code, the frame rate is annoyingly slow. The first step for speeding it up is to vectorize it with SSE. But note the *= in the inner loop that induces a dependence between loop iterations. The dependence prevents vectorization. Taking a mathematical look, we see that successive iterations are summing s, sv, sv^{2}, sv^{3}, sv^{4},.... We can't break the dependence, but we can stretch it by change s to a 4-element vector that is initially (s, sv, sv^{2}, sv^{3}), and changing v to be a 4-element vector v4 that consists of (v^{4},v^{4},v^{4},v^{4}). The inner loop now looks like this:

for( i=0; i<pixelsPerScanline; i+=4 ) {

sum[i..i+3] += s4

s4 *= v4;

}

For simplicity, I assume pixelsPerScanline is a multiple of 4. Each complex 4-element vector is stored with the real parts in one SSE register and the imaginary parts in another SSE register. On my machine, the vectorized version runs ~4.4x faster than the scalar code. I don't know where the superlinear extra .4 came from; I must have tickled the microarchitecture in a lucky way.

But I was nonetheless depressed, because the total speed was only 4.9 Gigaflop on a single core, which works out to a ~2.6 flops per clock. In theory, the Intel® Core™ microarchitecture can deliver up to 8 flops per clock (a single-precision SSE add and mul per clock). See the Intel® 64 and IA-32 Architectures Optimization Reference Manual. The information is mysteriously buried in Table 2-2 of Section 2.1.3.1. I would have thought it would be in Appendix C where the timing info for other processors is.

I started puzzling over why I came up so short of 8 flops per clock, particularly since I was hand-writing assembly code. Then it hit me: I had forgotten about instruction parallelism. The line "s4*=v4" has a latency of 10 clocks, because it requires four 4-cycle multiplies and two 3-cycle additions. They can overlap to a large degree, but if you play with it, you find there is a 10 clock latency that instruction scheduling cannot hide.

The solution is to stretch the dependence further, by computing 16 elements at a time, so that there are enough floating-point operations to keep the machine busy while the *= is being computed. I did this by creating vectors v8 = (v^{8},v^{8},v^{8},v^{8}), v12=(v^{12},v^{12},v^{12},v^{12}) and v16=(v^{16},v^{16},v^{16},v^{16}). Now the code looks like (assuming pixelsPerScanline is a multiple of 16):

for( i=0; i<pixelsPerScanline; i+=16 ) {

sum[i..i+3] += s4

sum[i+4..i+7] += s4*v4;

sum[i+8..i+11] += s4*v8;

sum[i+12..i+15] += s4*v12.

s4 *= v16;

}

The performance more than doubled, yielding ~10.4 Gigaflop on a single core. That's ~5.6 flops per clock. I played around some more, but couldn't get significantly closer to the 8 flops per clock limit; the need for register copying becomes a limiting factor.

The result is that in the worst case (14 frequons on my 1680x1050 wide screen display), I'm still well over my 30 frames-per-sec target. So even though it would be trivial, I haven't needed to take advantage of multiple cores for this game. Vector parallelism and instruction parallelism were enough.

## Comments (2)

TopAnonymous said on

Hmm....Certain things are so non-trivial about Intel...You just cant cut it easily...

Anyway, Your blog was a good read! Thanks!

Anonymous said on

Effective use of parallelism ALWAYS involves making sure that there's enough work to hide the latency/overhead. This basic principle is the essential nature of selecting the appropriate "grainsize" in your other blog entry. Compilers do it for us to a large degree and the Intel program of trying to improve the parallel programing tool environment is an attempt to extend that to make it easier to beneift from multiple core technologies. But sometimes it takes very careful analysis and a reasonably deep understanding to expose where the "work" and "overhead" lie to allow "best" performance.

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today