TLB, large pages and prefetching

TLB, large pages and prefetching

Hi all,

I have a couple of questions regarding the interaction of TLB, large pages and software prefetching:

1) As far as I understood from the documentation, for Nehalem and Sandy Bridge when using 2MB pages, there are only 32 entries available in DTLB1 and second level DTLB2 is not used. Can somebody confirm this?

2) When using 2MB pages, TLB miss handling cost, i.e. page walks are cheaper in 64-bit Linux due to 3-level page table directories instead of 4-level. Essentially, page walks require 3 times memory access instead 4 to read relevant page directory and page table entries. Does this reasoning make sense? Any ideas about the caching issues of page directory and page table entries?

3) Software prefetching becomes much more effective with 2MB pages in comparison to 4KB regular pages. What can be the reason for this observation? I always thought software prefetch instructions does not cross page boundaries but apparently the "Optimization Reference Manual" page 213 says that "In Intel Core microarchitecture, software PREFETCH instructions can prefetch beyond page boundaries and can perform one-to-four page walks". But there is no information regarding the situation in Nehalem and Sandy Bridge. Any ideas about this?


4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

3) It's hardware prefetch which stops at page boundaries. Software prefetch can cross pages, but you generally tune it to minimize the effect of in-page cache miss. it's generally impractical to prefetch far enough ahead to overcome the TLB miss overhead, so you can expect to see better performance when using a small enough number of huge pages.

Hi Tim,

Thanks for the reply. Can you suggest any reference for detailed information on how software prefetches are handled. I am still confused:

"Optimization Reference Manual" page 213 says:

Software PREFETCH operations work the same way as do load from memory operations, with the following exceptions:
• Software PREFETCH instructions retire after virtual to physical address translation is completed.
• If an exception, such as page fault, is required to prefetch the data, then the software prefetch instruction retires without prefetching data.

On the other hand, page 368 says:

There are cases where a PREFETCH will not perform the data prefetch. These include:
• PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.

The second statement is probably only valid for Pentium 4, but where can I find up-to-date information about newer processors (i.e. Nehalem, Sandy Bridge)?

What you said implies prefetches cause page walks in case of a TLB miss and update the TLB with the new mapping after the page walk. Essentially, prefetch both prefetches to TLB and caches. Is that correct?


p.s. I'm still looking for comments to questions 1 and 2.

(1) The number and type of DTLB entries for Nehalem/Westmere and Sandy Bridge are described in the Intel SW Optimization Guide (document 248966, I am using revision 026). According to Table 2-9 on page 2-20, you are correct that the the DTLB has only 32 entries for 2MB pages and that the STLB has zero entries for 2MB pages.

(2) Using 2MB pages does remove one level of indirection from the hierarchical translation, but the processor contains a cache of higher-level translation entries, so not every TLB miss requires a multi-level table walk. I don't recall seeing any published details on the implementation of this cache, but there might be something out there?

Each level of the hierarchical address translation has the same overall structure -- a 4kB page contains 512 entries, each of which points to a 4kB page in the next lower level of the translation. So each level of the translation accounts for 9 address bits. For 4kB pages, there are 12 untranslated bits and 4*9=36 translated bits, matching the expected 48-bit virtual address range supported by the processor. 2MB pages use only the top 3 9-bit translations (27 bits) plus an untranslated 21 bits inside each page. 1 GB pages use only the top 2 9-bit translations (18 bits) plus an untranslated 30 bits inside each page. 4MB pages are a special case that are not much used any more.

Ignoring the possibility of caching higher-order translations for a moment:
(1) Using 4kB pages allows mapping 256kB (64 entries) with the DTLB and 2MB (512 entries) with the STLB.
(2) Using 2MB pages allows mapping 64MB (32 entries) with the DTLB, while the STLB is unused.
(3) Using 1GB pages allows mapping 4GB (4 entries) with the DTLB, while the STLB is unused.

The caching of the upper-order bits in the page translation makes virtual address locality beneficial. For example, a single entry in the top-level translation table provides the mapping for all virtual addresses that match in the upper 9 address bits -- so it covers a 39-bit virtual address range.
If you additionally cache a single entry from the second-level translation table, you can match all addresses that match in the top 18 bits, covering 30 bits of virtual address range (1 GB).

John D. McCalpin, PhD "Dr. Bandwidth"

Login to leave a comment.