Please urgent. How memory paging are assigned to CPU cache sets

Please urgent. How memory paging are assigned to CPU cache sets

Hi
I’m looking at how memory pages are assigned to CPU cache sets. I know that RAM blocks are assigned to certain lines in CPU cache sets. So, I want to the mechanism to map pages to the cache. For example, when a process has a number of memory pages, these pages how will interact with CPU cache.
Regards
Younis

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Taylor Kidd (Intel)'s picture

Younis,

Is your question about the Intel Xeon Phi coprocessor or just in general?

If about the coprocessor, I would appreciate a little more detail as to the context of your question.

Regards
--
Taylor
 

Hi Taylor,

My question is general; I want to know how memory pages are mapping to CPU cache sets.<?xml:namespace prefix = o />

Regards,

Younis

John D. McCalpin's picture
Best Reply

The question is only barely relevant to the Xeon Phi because of the hybrid cache coherence protocol and the remapping of addresses required to make room for the ECC bits.

The L1 caches use a very standard 32 KiB, 8-way associative configuration with 64 Byte cache lines.  The cache is therefore 4KiB "tall" and 8 "ways" wide, so that the 64 cache lines in each 4KiB virtual memory page are mapped 1:1 to the 64 congruence classes of the cache.  Although some low-level details are probably unique, this cache configuration is usually referred to as "virtually indexed, physically tagged".

Beyond the L1 cache only physical addresses are used.  The Xeon Phi Software Developer's Guide (document 328207, revision 002, June 2013) says that the L2 cache is 512 KiB, 8-way associative (page 18).  512 KiB is 8 Ki cache lines, or 1024 cache lines per congruence class.   This is the equivalent of being 16 pages "tall" and 8 "ways" wide, though it is important to remember that the virtual to physical address mapping means that 4 KiB virtual memory pages can be mapped to any of these 16 "slots" ("slot" == a page-aligned set of 64 congruence classes), so contiguous 4 KiB virtual pages are not typically mapped to contiguous slots.  With 2MiB pages you are guaranteed that contiguous virtual addresses map to contiguous congruence classes, and accessing every cache line in a 2 MiB page is guaranteed to touch each congruence class exactly 32 times (i.e., exactly 4 passes through the entire cache).

Beyond the L2 the system is quite different than standard designs:

  • Every (physical) cache line address is mapped to one of 64 Distributed Tag Directories (DTDs) that are distributed around the Xeon Phi ring.  (Most are co-located with processors, but there are a few standalone DTDs to bring the total to 64.)   

    • The mapping of physical address to DTD is not documented, but experiments have shown that the 64 cache lines in each page map to DTDs all around the ring.  (I have not yet been able to decide if the mapping is 1:1 -- more experiments are needed.)
    • There is some evidence that the mapping is different for different 4KiB pages, but again more experiments are needed to get a clear picture.  (It can be helpful to use a different mapping for each page, since the use of page-aligned structures means that the "0" offset on each page is used significantly more frequently than the highest addresses in the page.  If the zero offset cache line of every 4KiB page mapped to the same DTD, it could become a significant hot spot.)
  • On an L2 cache miss, the request for the cache line is sent from the core to the appropriate DTD for that physical address.
    • The protocol is discussed on pages 18-21 of the Xeon Phi SW Developer's Guide.
    • If the data is present in one or more L2 caches, the DTD is responsible for creating the appropriate coherence requests -- such as invalidations or various types of interventions, etc.
    • If the data is not present in any L2 caches, the DTD remaps the physical address to a DRAM address and sends the request to the appropriate memory controller.
  • Because of the variable distance between a core and the DTD used for coherence control, cache-to-cache intervention latency can vary by a factor of up to three between adjacent cores depending on the physical address used.  Optimum selection of addresses for synchronization constructs could significantly improve the performance of barriers and collective operators.

The mapping of physical addresses to Memory Controller/Channel/Bank Group/Sub-Bank/Row/Column is non-trivial.  The specifics of the mapping are not documented, but it is possible to observe some characteristics.  These notes apply to the Xeon Phi SE10P with ECC enabled -- that is the only supported configuration for the 6880 Xeon Phi SE10P Coprocessors at TACC.

  • The bits used for error detection and correction are stored in the normal memory space, so physical addresses must be remapped to avoid whatever DRAM addresses are used to hold this data.

    • It is clear from several places (including /proc/meminfo) that exactly 1/32 of the memory is reserved for ECC bits.

      • Although this is a much lower overhead than the usual 1/8 used in DIMMs, it is enough to support SECDED (Single Error Correction, Double-Error Detection) if you operate on the entire cache line, rather than operating on one 64-bit transfer at a time (as is done with standard DIMM-based memory).
  • Using the Xeon Phi memory controller read and write counters, I was able to determine that contiguous physical addresses are assigned to DRAM channels in 62-cache-line blocks. 
    • The remaining 2 cache lines per 4KiB DRAM page are presumably used to hold the ECC bits -- this matches the 1/32 overhead expected.
  • Consecutive 62-cache-line-blocks are mapped using a fixed permutation to the 16 DRAM channels, repeating that permutation all the way from physical address 0 to the top of memory.
    • I have not (yet) been able to determine the bank and sub-bank mapping, but presumably a set of 256 62-cache-line-blocks will map to all 16 banks/sub-banks on each of the 16 DRAM channels.

From the point of view of the user, this means that each 4KiB page maps to at least 2 DRAM pages, with 2 to 62 cache lines at the beginning of the page mapping to one DRAM page and the remaining 62 to 2 cache lines mapped to a different DRAM page.  Since the virtual to physical address mapping is fixed within a 4KiB page, the physical addresses at this DRAM page crossing are contiguous, which guarantees that the 2 DRAM pages used are on different DRAM channels.   Going from one 4KiB virtual page to the next, however, results in a pseudo-random change to the higher-order address bits, so there are no constraints on the DRAM page transition.  One would expect that 1/16 of the transitions between 4 KiB pages would go to a DRAM page on the same DRAM channel.  Of those 1/16, 3/4 would map to rows in different bank groups, 3/16 would map to rows in different sub-banks of the same bank group, and 1/16 (1/256 of the total) would map to a different row in the same bank of the same bank group on the same DRAM channel.

With 2 MiB pages, contiguous virtual addresses map to contiguous physical addresses for the full 2 MiB range.  Given the assumptions above, the repeat period for the DRAM mapping is 256 blocks of 64 contiguous cache lines, or 31/32 of 1 MiB, so contiguous accesses all addresses in a 2 MiB page should result in traversing the entire DRAM mapping a bit more than 2 times.  Going from one 2MiB page to the next introduces the same potential for bank conflicts as seen with 4KiB pages, but since it only happens 1/512th as often it should not be frequent enough to introduce performance problems.

In summary -- part of the mapping is standard and easy, while part of the mapping is non-standard and complex.  It is not clear to me that having this information is useful from a bandwidth perspective (though I am still trying), but it definitely has potential to be useful for pieces of code that depend on cache-to-cache intervention latency (such as synchronization constructs).

John D. McCalpin, PhD "Dr. Bandwidth"
Taylor Kidd (Intel)'s picture

John,

Thank you for the detailed reply. I'm sure that Younis will be able to apply it to his current Xeon Phi efforts.

Regards
---
Taylor

 

Thank you so much John.

Regards,

Younis

Login to leave a comment.