When the hardware prefetches cache lines on sandybridge, does it prefetch one cache line at a time or two?
I believe Sandybridge only prefetches one line at a time but it may be hard to really see this. For instance, the processor will speculatively execute instructions. If you have a tight loop which walks through an array, even if you turn off the prefetchers, the processor will try to execute instructions ahead. So there may be multiple cache misses outstanding regardless of prefetch on or off.
What are you trying to measure or understand and why? The answer to these questions will help us better reply to you.
There are four hardware prefetchers on systems with Sandy Bridge cores, and they all have different properties. The most detailed descriptions that I know of are in the Intel Optimization Reference Manual (document 248966-028, July 2013). Section 188.8.131.52 describes the prefetchers in the Sandy Bridge architecture. It looks like the two L1 prefetchers only fetch one line at a time, and the "L2 spatial" prefetcher that completes line pairs probably also only fetches one line at a time. For the fourth prefetcher (the "L2 streamer"), the document says: * "The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request." If the streamer can issue two prefetch requests for every L2 lookup, then if my spreadsheet is correct it should take 20 loads into a stream to get to the point where the two L2 streamer prefetches are 19 and 20 lines ahead of the current L2 lookup. Since none of the prefetchers on Sandy Bridge cross 4 KiB page boundaries (64 cache lines), this is pretty close the the maximum practical amount of prefetching if you are limited to two prefetches per access and you don't want "holes" in the pattern of prefetched lines. Section 184.108.40.206 of the Optimization Reference Manual says that Ivy Bridge includes a "next page prefetcher", but there are not many details. Prefetching across 4 KiB page boundaries probably only makes sense for the core prefetcher, since a virtual to physical address translation is needed to find out where the next 4 KiB (virtual) page is mapped in the physical address space. Enabling an L1 hardware prefetcher to access the TLB is not too scary. Allowing these "next page prefetches" to trigger page table walks is more problematic: * It does not take many false positives tying up the page table walker to overcome the performance benefit of helping the prefetches ramp up more quickly. * On the other hand, *not* allowing "next page prefetches" to trigger page table walks would limit their usefulness to fairly small data set sizes. The only processors that I know of that allow hardware prefetches across 4KiB page boundaries are the IBM Power6 and later processors. These have an L1 hardware prefetcher that can see if large pages are being used. (This is possible because the page size is a function of the "segment", so the page size info is available to the core in the translation cache entries.) If large pages are being used, the core hardware prefetcher can cross 4 KiB boundaries, but it still cannot cross large page boundaries.
Thanks a lot John. This was very informative.
Thanks Pat. We are moving to Sandybridges and am building the knowledge base.
We have Intel Xeon E5-2697 v2, Ivy Bridge Micro-architecture based machine. Can someone tell how can i disable/enable hardware prefetchers it it??
The BIOS often offers a setting to disable or enable hardware prefetchers.
If your BIOS does not have such settings, or if you don't want to reboot your system, the hardware prefetchers can also me manipulated via model-specific registers (MSR). In particular with MSR 0x1A4, you can disable prefetchers by setting bit 0 for the L2 streamer, bit 1 for L2 adjacent cacheline prefetcher, bit 2 for L1 prefetcher, and bit 3 for the L2 IP prefetcher. Details can be found in this article, but essentially need to do the following to disable prefetchers:
The most convenient way to read and write MSRs is with the msrtools.
Please note that MSRs are specific to cores. So, you need to disable prefetching on all cores or otherwise make sure that you run your test program on the core where you have disabled the prefetcher.
Don't forget to enable the prefetchers again before doing other performance tests. :)
If you have tight or small loop up to IIRC 52 uops then LSD will cache this loop entirely in its buffer, but of course without any memory load operations. Bear in mind that probably HW prefetchers can follow linear memory access. I am not sure how prefetcher will behave when array index is randomly calculated.
If array index is random, "spatial" or adjacent line prefetch would still start up and bring in the paired cache line for reads. This may saturate memory bandwidth or bring about cache capacity evictions unnecessarily when all hyperthreads are in use. "streaming" prefetch would rarely trigger accidentally if the indexing is truly random; if not random, it may be beneficial, although it will bring in some cache lines which aren't reached on account of loop termination or ending a sequence of accesses which triggered it. We have seen applications (limited to 1 thread per physical core) prefetch 30% more cache lines than are actually used and still get a big benefit from streaming prefetch.
Apparently, what are now called "streamers" (based on detected strides within a page) used to be called hardware prefetch, and what are now called "spatial" used to be called adjacent or 2nd sector and distinguished from "hardware prefetch."
I haven't seen it discussed expertly, but I suppose the spatial prefetch becomes useless at some point for data sequences which trigger streamers, but may have helped in the startup phase and after crossing to a new page (if the stride is not too large).
Sandy Bridge combined Loop Stream Detection with uop caching which seems to smooth over obvious jumps up and down in performance depending on loop unrolling in earlier CPUs. The LSD limit is big enough now to cover unroll by 4 for the simplest loops. I see the 2016 beta Vector Advisor complaining sometimes about too much unrolling on certain loops but no evidence that it actually hurts, other than when it contributes to spending significant time in the remainder loop (which Advisor also tells you about).
I haven't determined whether occasional improvements in performance by increasing data alignments to 32-byte (even when not using instructions which want more than 16-byte alignment) may be associated with prefetcher action. It's possible that 32-byte alignment may reduce the number of prefetched cache lines.