When the hardware prefetches cache lines on sandybridge, does it prefetch one cache line at a time or two?
I believe Sandybridge only prefetches one line at a time but it may be hard to really see this. For instance, the processor will speculatively execute instructions. If you have a tight loop which walks through an array, even if you turn off the prefetchers, the processor will try to execute instructions ahead. So there may be multiple cache misses outstanding regardless of prefetch on or off.
What are you trying to measure or understand and why? The answer to these questions will help us better reply to you.
There are four hardware prefetchers on systems with Sandy Bridge cores, and they all have different properties. The most detailed descriptions that I know of are in the Intel Optimization Reference Manual (document 248966-028, July 2013). Section 18.104.22.168 describes the prefetchers in the Sandy Bridge architecture. It looks like the two L1 prefetchers only fetch one line at a time, and the "L2 spatial" prefetcher that completes line pairs probably also only fetches one line at a time. For the fourth prefetcher (the "L2 streamer"), the document says: * "The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request." If the streamer can issue two prefetch requests for every L2 lookup, then if my spreadsheet is correct it should take 20 loads into a stream to get to the point where the two L2 streamer prefetches are 19 and 20 lines ahead of the current L2 lookup. Since none of the prefetchers on Sandy Bridge cross 4 KiB page boundaries (64 cache lines), this is pretty close the the maximum practical amount of prefetching if you are limited to two prefetches per access and you don't want "holes" in the pattern of prefetched lines. Section 22.214.171.124 of the Optimization Reference Manual says that Ivy Bridge includes a "next page prefetcher", but there are not many details. Prefetching across 4 KiB page boundaries probably only makes sense for the core prefetcher, since a virtual to physical address translation is needed to find out where the next 4 KiB (virtual) page is mapped in the physical address space. Enabling an L1 hardware prefetcher to access the TLB is not too scary. Allowing these "next page prefetches" to trigger page table walks is more problematic: * It does not take many false positives tying up the page table walker to overcome the performance benefit of helping the prefetches ramp up more quickly. * On the other hand, *not* allowing "next page prefetches" to trigger page table walks would limit their usefulness to fairly small data set sizes. The only processors that I know of that allow hardware prefetches across 4KiB page boundaries are the IBM Power6 and later processors. These have an L1 hardware prefetcher that can see if large pages are being used. (This is possible because the page size is a function of the "segment", so the page size info is available to the core in the translation cache entries.) If large pages are being used, the core hardware prefetcher can cross 4 KiB boundaries, but it still cannot cross large page boundaries.
Thanks a lot John. This was very informative.
Thanks Pat. We are moving to Sandybridges and am building the knowledge base.