TLB, large pages and prefetching

TLB, large pages and prefetching

Hi all,

I have a couple of questions regarding the interaction of TLB, large pages and software prefetching:

1) As far as I understood from the documentation, for Nehalem and Sandy Bridge when using 2MB pages, there are only 32 entries available in DTLB1 and second level DTLB2 is not used. Can somebody confirm this?

2) When using 2MB pages, TLB miss handling cost, i.e. page walks are cheaper in 64-bit Linux due to 3-level page table directories instead of 4-level. Essentially, page walks require 3 times memory access instead 4 to read relevant page directory and page table entries. Does this reasoning make sense? Any ideas about the caching issues of page directory and page table entries?

3) Software prefetching becomes much more effective with 2MB pages in comparison to 4KB regular pages. What can be the reason for this observation? I always thought software prefetch instructions does not cross page boundaries but apparently the "Optimization Reference Manual" page 213 says that "In Intel Core microarchitecture, software PREFETCH instructions can prefetch beyond page boundaries and can perform one-to-four page walks". But there is no information regarding the situation in Nehalem and Sandy Bridge. Any ideas about this?

Thanks,

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

3) It's hardware prefetch which stops at page boundaries. Software prefetch can cross pages, but you generally tune it to minimize the effect of in-page cache miss. it's generally impractical to prefetch far enough ahead to overcome the TLB miss overhead, so you can expect to see better performance when using a small enough number of huge pages.

Hi Tim,

Thanks for the reply. Can you suggest any reference for detailed information on how software prefetches are handled. I am still confused:

"Optimization Reference Manual" page 213 says:

Software PREFETCH operations work the same way as do load from memory operations, with the following exceptions:
• Software PREFETCH instructions retire after virtual to physical address translation is completed.
• If an exception, such as page fault, is required to prefetch the data, then the software prefetch instruction retires without prefetching data.

On the other hand, page 368 says:

There are cases where a PREFETCH will not perform the data prefetch. These include:
• PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.

The second statement is probably only valid for Pentium 4, but where can I find up-to-date information about newer processors (i.e. Nehalem, Sandy Bridge)?

What you said implies prefetches cause page walks in case of a TLB miss and update the TLB with the new mapping after the page walk. Essentially, prefetch both prefetches to TLB and caches. Is that correct?

Thanks,

p.s. I'm still looking for comments to questions 1 and 2.

(1) The number and type of DTLB entries for Nehalem/Westmere and Sandy Bridge are described in the Intel SW Optimization Guide (document 248966, I am using revision 026). According to Table 2-9 on page 2-20, you are correct that the the DTLB has only 32 entries for 2MB pages and that the STLB has zero entries for 2MB pages.

(2) Using 2MB pages does remove one level of indirection from the hierarchical translation, but the processor contains a cache of higher-level translation entries, so not every TLB miss requires a multi-level table walk. I don't recall seeing any published details on the implementation of this cache, but there might be something out there?

Each level of the hierarchical address translation has the same overall structure -- a 4kB page contains 512 entries, each of which points to a 4kB page in the next lower level of the translation. So each level of the translation accounts for 9 address bits. For 4kB pages, there are 12 untranslated bits and 4*9=36 translated bits, matching the expected 48-bit virtual address range supported by the processor. 2MB pages use only the top 3 9-bit translations (27 bits) plus an untranslated 21 bits inside each page. 1 GB pages use only the top 2 9-bit translations (18 bits) plus an untranslated 30 bits inside each page. 4MB pages are a special case that are not much used any more.

Ignoring the possibility of caching higher-order translations for a moment:
(1) Using 4kB pages allows mapping 256kB (64 entries) with the DTLB and 2MB (512 entries) with the STLB.
(2) Using 2MB pages allows mapping 64MB (32 entries) with the DTLB, while the STLB is unused.
(3) Using 1GB pages allows mapping 4GB (4 entries) with the DTLB, while the STLB is unused.

The caching of the upper-order bits in the page translation makes virtual address locality beneficial. For example, a single entry in the top-level translation table provides the mapping for all virtual addresses that match in the upper 9 address bits -- so it covers a 39-bit virtual address range.
If you additionally cache a single entry from the second-level translation table, you can match all addresses that match in the top 18 bits, covering 30 bits of virtual address range (1 GB).

"Dr. Bandwidth"

I believe that the conditions under which the page table walker uses the hierarchical caching structures are unclear to everyone....

Based on a variety of performance measurements for contiguous (or nearly contiguous) accesses, it is apparent that TLB misses are sufficiently inexpensive that one must concludes that almost all levels of the hierarchical page translation are cached with very high cache hit rates.

In one specific set of tests, the time spent performing table walks was very close to 14 cycles plus the cache hit latency for the level of the cache where the Page Miss Handler found the Page Table Entry.  This used the DTLB_LOAD_MISSES.WALK_DURATION counter to determine the total time required for the table walks and the PAGE_WALKER_LOADS.* performance counter events to show where the PTEs were found.  In this particular set of tests most of the PTEs were found in the L1, so the test is a stronger bound on the (assumed constant) overhead than on the latency for loading the PTEs from each level of the memory hierarchy.

It would require a fair bit of analysis to develop a testing methodology to try to extract the parameters of the caching mechanisms for the upper layers of the hierarchical address translation hardware.  One would probably need to start by carefully directed testing to confirm the published parameters (size & associativity) of the DTLB and STLB, then more testing to understand how the STLB is "shared" between 4KiB and 2MiB translations in the Haswell core.  Once that is all in place, it might be possible to design tests to attempt to overflow the caching mechanisms for the higher levels of the address translation mechanism, monitoring the results by both timing and by DRAM accesses.

"Dr. Bandwidth"

I also read the text as saying that the address translation caching structures only cache the top 3 levels of the translation.  Most of the information that is in the PTEs will be serviced from TLB hits, and the remainder should have good spatial locality in the "normal" data cache hierarchy.   At higher levels of the translation hierarchy, entries can still be in the "normal" caches (depending on configuration settings that are really confusing to me -- or perhaps it is just the documentation of the configuration settings that is confusing!) but the interval between accesses to adjacent entries (in the same cache line) is large enough that they are unlikely to still be in the cache(s).   For example, there are 8 PTEs in a cache line, so an STLB miss will bring a cache line containing 8 entries and put it in the L1 cache.  This is precisely enough to map the 32 KiB of the L1 cache, so you should be able to access the entire L1 Data Cache before you evict the cache line containing the PTEs.  This is consistent with my measurement of an L1 hit rate of 87.5% (7/8) for the PAGE_WALKER _LOADS.DTLB_L1 event.

My tests only showed a very small number of DTLB_LOAD_MISSES.PDE_CACHE_MISS events.  This is not surprising given the relatively small address range I was using (so no strain on capacity) and the contiguous access patterns (which should not trigger any type of conflict misses). 

We are going to start to monitor DTLB_LOAD_MISSES.WALK_DURATION on some of our production systems at TACC and if we find any significant applications that spend more than ~10% of their time in TLB walks, then I will dig into this further.  My guess is that we won't find any codes that spend more than a few percent of their time in TLB walks, but I have been wrong before....

"Dr. Bandwidth"

The discussion of the meaning of the PCD and PWT bits in the various upper levels of the translation entries is confusing to me, since the text is all full of caveats about the bits having different meanings in different modes of operation.

For example, in Section 2.5 of Volume 3 of the SWDM, the text says that for CR3, the PCD and PWT bits control whether or not the memory reference that accesses the top level of the page table entry (PML4) is cached or not.  The description is in Section 4.9.2 where it appears to say that

  1. If CR4.PCIDE=1 (process context identifiers are enabled), then the type is taken from element 0 of the PAT Table.
  2. If the paging mode is IA32e (64-bit), but CR4.PCIDE is 0 (process context identifiers are not enabled), then the type is taken from element 2*PCD+PWT of the PAT Table, so it could be any of 0,1,2,3

The PAT Table is held in the IA32_PAT MSR (0x277), with types are defined in Tables 11-10, 11-11, and 11-12.

Table 11-12 says that the power-on/reset values of PAT Table entries 0,1,2,3 correspond to WB, WT, UC-, and UC types, respectively.   But I noticed that Linux changes these -- on my systems (Haswell and Sandy Bridge) I see:

rdmsr -p 0 -x -0 0x277
0007010600070106

Reading from right to left, the four entries are:

  • PAT0 = 0x06 = WB (WriteBack)
  • PAT1 = 0x01 = WC (Write Combining)
  • PAT2 = 0x07 = UC-
  • PAT3 = 0x00 = UC

So this says that PML4 entries are always loaded as cached accesses if Process Context Identifiers are enabled (as they are on my Haswell systems).   However, if Process Context Identifiers are disabled (as they are on my Sandy Bridge systems), then the memory type used to access the PML4 entry can be of type PAT0, PAT1, PAT2, or PAT3, or WB, WC, UC-, or UC, respectively. 

Of course these are subject to being overridden by stricter MTRR values, but since the page tables live in the main part of system memory, I will assume that this won't happen.  

I don't know any easy way to get the contents of the CR3 register for a currently running process so that I can examine the PCD and PWT bits.

The same process is repeated for PDPT, PDE, and PTE entries (based on the PCD and PWT bits in the "next level up" PML4, PDPT, and PDE entries, respectively).   These don't appear to be influenced by enabling Process Context Identifiers.  

So I don't know what is really happening without looking at the PCD and PWT bits in CR3 (on the systems where Process Context Identifiers are disabled) and at the PCD and PWT bits in the PML4 and PDPT entries.   I know that the PCD and PWT bits in the PDE entries must map to the WB memory type because the PAGE_WALKER_LOADS.* event routinely finds most of the PTEs in the caches.  

Putting the higher-level page translation entries in the caches has three effects: (1) it reduces latency if you need to reload the same value before it gets flushed from the cache, (2) it reduces latency if you need to load adjacent values (in the same cache line) before the line gets flushed from the cache, and (3) it displaces a data (or instruction) cache line from the cache.    I would guess that (3) is not a problem.   I would also guess that (1) and (2) are not likely to happen very often for PML4 and PDPT entries if there is a non-stupid specialized cache for these entries.    So we have a combination of not much upside and not much downside in most cases.  There is probably some discussion of this in the Linux kernel, but I find it nearly impossible to work through all the levels of macros and configuration options to get to the actual behavior for my system.

For PDEs, I think caching makes sense -- a cache line contains 8 PDEs, each of which points to a cache line that will hold 8 PTEs, so the cache line holding PDEs helps to map 8*8*4=256KiB.  This is large enough to be useful and small enough that you should be able to access it all before the cache line holding the PDE gets evicted from the L2 cache.

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today