i have a question about the FP peak performance of my core i7
920.
I have an application that does a lot of MAC operations (basically a
convolution operation), and i am not able to reach the peak FP
performance of the cpu by a factor of ~8x when using multi-threading and
SSE instructions.
When trying to find out what the reason was for this i ended up with a
simplified code snippet, running on a single thread and not using SSE
instructions which performs equally bad:
{
data[i] += other_data[i] * other_data2[i];
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPsIt executes in ~150 ms (i'm very sure this timing is correct, since C
timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/secWhere the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? I first thought it was because the application should be memory limited, but that would mean:
The peak stream b/w of my cpu is ~16.4 GB/s
right? So let's say every iteration i require 3 FP reads and 1 FP write,
or 16 bytes of bandwidth. This would require 789.364.224 bytes of
traffic to the main memory in the entire application (assuming nothing is cached), which runs in ~150 ms. This would
mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i
don't hit this bandwidth ceiling?
I also tried changing the operation within the loop to " data[i] += 2.0 * 5.0 " to test whether this would improve the performance, but this yields the exact same performance.
Thanks a lot in advance, and i could really use your help!


