I have a couple of questions regarding the interaction of TLB, large pages and software prefetching:
1) As far as I understood from the documentation, for Nehalem and Sandy Bridge when using 2MB pages, there are only 32 entries available in DTLB1 and second level DTLB2 is not used. Can somebody confirm this?
2) When using 2MB pages, TLB miss handling cost, i.e. page walks are cheaper in 64-bit Linux due to 3-level page table directories instead of 4-level. Essentially, page walks require 3 times memory access instead 4 to read relevant page directory and page table entries. Does this reasoning make sense? Any ideas about the caching issues of page directory and page table entries?
3) Software prefetching becomes much more effective with 2MB pages in comparison to 4KB regular pages. What can be the reason for this observation? I always thought software prefetch instructions does not cross page boundaries but apparently the "Optimization Reference Manual" page 213 says that "In Intel Core microarchitecture, software PREFETCH instructions can prefetch beyond page boundaries and can perform one-to-four page walks". But there is no information regarding the situation in Nehalem and Sandy Bridge. Any ideas about this?