With the help of John McCalpin's comments (http://software.intel.com/en-us/forums/topic/456184), I'm finally starting to understand why Sandy Bridge performs as it does on memory benchmarks such as Stream. As I understand it, each outstanding L1 cache miss occupies a Line Fill Buffer (LFB). Since each Sandy Bridge core has only 10 LFB's, there can only be 10 memory requests in flight at any time. According to one formulation of Little's Law (Bandwidth = Concurrency x Latency) if we assume a fixed latency this limit on concurrency puts an upper limit on memory bandwidth.
To test this theory, I started doing measurements on the simplest benchmark I could think of: a single-threaded single-stream read. To avoid being confused by compiler optimizations, I wrote the inner loop using x64 SSE inline assembly, and verified that the generated code matched my expectation. Once I realized that each load request actually loads 2 cache lines (128 bytes) the measured read bandwidth of ~16GB/s made more sense:
128 Bytes/LFB * 10 LFB's = 1280 Bytes in flight / 16 GB/s = 75 ns latency, which seems plausible.
Reducing the size of the array, the measured bandwidth from L3 was about twice this at ~32 GB/s. At first this was confusing, because plugging this in to the formula would imply that L3 has a latency of ~35 ns, instead of the ~30 cycles (~8 ns) I expected. But since transfers from the L3 Ring Bus are 32B rather than 128B, I was also able to make this work:
32 Bytes * 10 = 320 Bytes in flight / 32 GB/s = ~9 ns, which is close enough to make sense.
Reducing the array further so it fit in L1, I measured about twice the L3 bandwidth: ~64 GB/s. This also seems to fit, as each SSE vector is 16B and reading a vector from L1 on Sandy Bridge should take 7 cycles:
16 Bytes * 10 = 160 Bytes in flight / 64 GB/s = ~2 ns = ~7 cycles, which seems remarkably close.
The fits seem almost too good to be true. I'm left with lots of questions.
First, does this summary seem correct? Am I missing something major, like other limits to concurrency? For example, Is 10 the right number of LFB's? Is my presumption about 32B L3 transfers correct? Is the L3 -> register bandwidth actually constrained by the number of stops on the L3 Ring Bus rather than LFB's? And is it really the case that although you can issue two vector loads per cycle, you'll never be able to sustain 32B/cycle on Sandy Bridge even if the data is all in L1?
Then, my next questions would be about the usage of the LFB's. Is it correct that they are not consumed by hardware prefetches? That they are used for all L1 cache misses, and not just L3 cache misses? Are the LFB's also consumed for software prefetches, or can software prefetches be used to skirt the concurrency limit? Are there any other approaches that can be used to get the data from RAM to L3 or L3 to RAM without using up these buffers?
Finally, is there an ordering to access RAM that would reduce the latency further? On John's blog (http://blogs.utexas.edu/jdm4372/2010/11/09/optimizing-amd-opteron-memory...) he describes trying to read from already open RAM pages, but I'm not sure how much of this is AMD specific. Is there an ordering for Intel that maximizes open page accesses? The best information I've found for Intel is also from John (http://software.intel.com/en-us/forums/topic/393131) but I'm not sure how to apply it.