Can somebody explain what is the exact meaning of term "non-snoop" for both read and write.
I am trying to get better understanding of PCIe i/o events in CBo which are named "non-snoop read" and "non-snoop write".
In what context this question? Can you describe in more detail your problem, exactly where are these terms? Probably not open anything new when I say that Snoop - the result of the cache coherent protocol.
>>...In what context this question?..
I think this is cache related ( however, I could be wrong ) and I agree more technical details are needed.
I am assuming that this is in the context of a Xeon E5-2600 series processor, where the uncore performance monitors are described in Intel document 327043 ("Intel Xeon Processor E5-2600 Product Family Uncore Performance Monitoring Guide").
In section 2.3 (CBo events), Table 2-13 lists a filter for PCIe non-snoop read and non-snoop write operations. These are probably associated with accesses to addresses that are mapped by an MTRR (or by the default memory type) as uncacheable. Since the processors are not allowed to cache these addresses, they are not required to snoop accesses to those addresses. This applies to most memory-mapped IO regions, so it is not surprising to see this in the context of PCIe events in the CBo.
In section 2.4 ("Home Agent" events), Table 2-44 describes a performance counter event related to directory lookups and remote snoops. Intel processors supporting 4 (or more) sockets have a "directory" that keeps track of cache lines that might be cached in another chip. If this "directory" indicates that a particular cache line has not been read by another chip, then local accesses to the line don't need to initiate a global snoop.
>>...I am assuming that this is in the context of a Xeon E5-2600 series processor...
I wouldn't assume regarding a hardware because it could be related to:
- Intel® Core™ i7 Mobile Extreme Edition Processor Series
- Intel® Core™ i5 and i7 Mobile Processor Series
- Intel® Celeron® Mobile Processor Series
- Intel® Xeon® Processor E5-1600 / E5-2600 / E5-4600 Product Families
- and so on
In every datasheet (!) for these Series I found references to snooping. For example, here is a small quote for Intel® Core™ i5 and i7 Mobile Processor Series:
Snooped accesses from PCI Express or DMI to this region are snooped on processor caches.
Non-snooped accesses from PCI Express or DMI to this region are always sent to DRAM.
Snooping is part of cache coherency protocol implementations. If cache coherency is not required for correctly functioning application/driver one can issue non-snooped transactions that don't snoop (potentially more recent) copies of data in caches but directly read/writes memory.
The PCI express protocol includes a "no snoop required" attribute in the transaction descriptor.
For a PCIe non-snooped read, the request can go directly to the DRAM controller to obtain the data. The processor caches do not need to be snooped, and the PCIe device does not need to wait for a snoop response before using the data. This can reduce the latency for obtaining the data, which can increase the sustained read bandwidth in the common case that the hardware supports a limited number of concurrent read transactions.
For a PCIe non-snooped store, the request can go directly to the DRAM controller to store the data. The processor caches do not need to be snooped to invalidate any copies of that cache line. This reduces the amount of time that the buffer handling the store is occupied, so that a fixed number of buffers can deliver higher throughput.
These "no snoop required" transactions are typically "safe" for accesses to address ranges for which processor cacheing is prohibited.
An example use case is a GPU that needs to "borrow" extra memory from the processor(s) for "spill" and "restore" traffic. Only the GPU will be accessing that memory, so it does not need to look in the processor caches to see if any of them has modified copies of the cache lines. The improvement in bandwidth due to the elimination of snooping can improve graphics frame rates.
The PCIe read/write bandwidth improvement is typically the primary reason to implement the "no snoop required" functionality.
Secondary benefits include reduction in snooping traffic on the processor caches, reducing coherence traffic on the chip-to-chip links in multi-chip systems, and reducing overall power consumption.
Since the original message mentioned PCIe-related CBo events, it seems reasonable to assume that the counters are referring to PCIe transactions with the "no snoop required" bit set.
What would be the net effect from the CPU perspective if PCIe memory read/write operations w/ "No Snoop" bit not set happen to the Windows memory buffer allocated as non-cached (MmAllocatePagesForMdlEx / MmNonCached)?
I am not entirely sure that I am parsing the question properly, but it sounds like you are asking about one of the cases that I addressed above -- i.e., "address ranges for which processor caching is prohibited". In this case a PCIe memory read with the "no snoop" bit set may not interact with the processor core or caches at all -- the PCIe controller can send the read request to the memory controller that owns the target address, and the memory controller can return the data directly to the PCIe controller. Similarly, a PCIe memory write with the "no snoop" bit set may not interact with the processor core or caches at all -- the PCIe controller can sent the write transaction (with its data) directly to the memory controller, with no need to have the corresponding addresses snooped by the caches.
BUT, it is important to understand that this attribute can be ignored. The PCIe specification contains this attribute because there are cases in which the software knows that an address cannot be held in a processor cache, so it can't be dirty there (for a PCIe read from memory), and it does not need to be invalidated there (for a PCIe write to memory). The specification does not say that the snoop is *prohibited* -- only that it is not *necessary* for correctness on a system that has cache properties that are similar to those that the authors of the PCIe specification envisioned when they included this feature.
Conversely, the absence of the "no snoop" bit does not mean that a snoop is absolutely required. On x86 systems, for example, cacheability is controlled by both the MTRRs and the PATs. There are only a few MTRRs, so the PCIe controller, the Host Agent, the QPI controller, etc, can all have copies, and can all quickly check to see if an address is marked as uncacheable in the MTRRs. If an address is in a range that is declared uncacheable by the MTRRs, then any "box" that handles the transaction can safely use transactions that do not generate snoops -- even if the "no snoop" bit has not been set by the software. The PATs, on the other hand, are in the page tables, which are not typically accessed by IO transactions. So if an address is in a range that is marked uncacheable by a PAT (but not by an MTRR), then the various boxes will not know that the page cannot be cached, and so must take the more conservative approach of using snooped transactions if the "no snoop" bit is not set.
Finally, not all systems are configured in the way that the PCIe spec envisioned. For example, a PCIe "no snoop" memory read or write may not need to snoop an L3 cache, but the transaction may be sent to the L3 controller anyway -- perhaps because the L3 controller is responsible for part of the overall address translation or ordering flow. In this case the L3 controller would "see" the "no snoop" transactions, but would not need to perform an L3 directory snoop. Intel has not disclosed enough detail about its on-chip protocols to fully understand the transactions required for various operations. The information in the uncore performance monitoring guide is often helpful in guessing what Intel's implementation does, but it is seldom definitive.
Perfect elaboration, thank you!
Hi John D. McCalpin,
Could you please explain the differences b/w,
>>non snoop & snoop latencies on Latency tolerance reporting messages in PCIE?
>>> As well the use of Requirement bit?
Attached a LTR message snap shot taken from PCIE trace
I have never worked with the Latency Tolerance Reporting feature of PCI Express, so I don't have any insights into how it is used. It looks like an optional facility with optional conformance to the reported values, so it is hard to know how seriously it is used. This class of definition looks like it is most useful for dealing with large changes in requirements due to power-state/power-saving-mode changes in the host and PCIe devices, and less useful for dealing with fine-grained latency differences.