PCIe performance counters

PCIe performance counters

Hi All,

Can someone please explain the difference between:

     PCIeWiLF  - PCIe Write transfer (non-allocating) (full cache line)
     PCIeItoM  - PCIe Write transfer (allocating) (full cache line)

Or point me to relevant documentation.

 

Thanks in advance,

Roman

 

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I have also been unable to find documentation on these two sub-events, but I suspect that they are related to Intel's "Data Direct IO" functionality first introduced in the Xeon E5 processor series.   With this feature, IO DMA traffic is written directly to the LLC instead of being written to memory, so when a core is interrupted to handle the IO, the data is available at much lower latency and higher bandwidth.  This makes the most sense for network interface traffic, where the packets are small enough for memory latency to be a non-trivial overhead and also small enough that you don't need to worrying about overflowing the LLC.  The "Data Direct IO" feature is enabled by default on Xeon E5 processors.

Reference: http://www.intel.com/content/dam/www/public/us/en/documents/technology-b...

With this background, the "PCI Write Transfer (allocating) (full cache line)" event seems like a reasonable description of a DMA write with Data Direct IO operational -- it "allocates" into the LLC.

It is less clear how to interpret the other event "PCI Write Transfer (non-allocating) (full cache line)".   I have two ideas:

  1. The Intel documentation (both the reference above and other docs) says "Currently, Intel DDIO affects only local sockets".  This might mean that the data is always put in the LLC of the socket to which the IO device is attached, or it might mean that the data is put in the LLC of the socket to which the device is attached *if* that socket is the "home" for the addresses being used.  In the latter case, PCIe DMA writes to addresses "homed" on the remote socket would be written to memory (on the remote socket) and not put in any LLC.
  2. Alternately, it is possible that there is a mechanism that could be used to disable DDIO for certain PCIe devices.  One could imagine that extremely large block IO (for a Lustre filesystem operating over InfiniBand, for example) might displace too much useful data from the LLC to make DDIO appropriate for those transactions.  If such a mechanism exists (and I have seen no documentation on this topic), the corresponding PCIe DMA write transactions would fit in the "non-allocating" category.

Of course this is all just speculation, and these events might have nothing to do with DDIO.

It would be delightful if Intel decided to document these features in more detail.

John D. McCalpin, PhD
"Dr. Bandwidth"

Maybe PCIe  specification can be helpful in your case?

http://komposter.com.ua/documents/PCI_Express_Base_Specification_Revisio...

It is unlikely that this issue would be directly addressed by the PCIe specification -- it is a processor uncore counter that says something about the PCIe implementation on the Xeon E5 processor.   The PCIe spec says very little about caches or cache lines, though there is a "hint" field that can be programmed to "steer" a system-memory transaction toward a processor or cache, but these "hints" are optional, and there is no requirement that a system interpret them in any particular way.

There is more overlap between the PCIe spec and the uncore counters with regard to the "no snoop required" bit, but again the PCIe spec does not require any particular behavior -- a processor is allowed to snoop requests with the "no snoop required" bit set, and a processor is allowed to refrain from snooping requests with the "no snoop required" bit cleared it if can be guaranteed (e.g., by an MTRR attribute) that the transaction is to an address that cannot be cached.

John D. McCalpin, PhD
"Dr. Bandwidth"

 

You are right.This question will not be addressed by PCIe specification.I simply did not pay enough attention to the thread title.

 

Hello Roman,

Looking at the PCM code, both are write events: PCI devices writing to memory - application reads from disk/network/PCIe device. I'm guessing that PCIeWiLF results in a write to memory but not a transfer to the CPU's cache. The 'iLF' part of the name is curious... perhaps it means that the data is copied to a LineFill buffer and not copied to cache (if that is possible).

And I'm guessing that PCIeItoM results in a write to memory and copying the line to cache. This along the lines of Dr. McCalpin's answer above.

I'll see if I can find someone who knows more definitively. Is your question just because you are curious or is it related to a problem you are trying to solve?

Pat

I was thinking that PCIeWiLF could represent for example write to non cacheable memory by maybe display driver.AFAIK primitive 3d data like vertices which will be used only once will not be cached.

Hi All,

First of all thanks for the answers.

I probably need to tell more about my setup...

 

I have two servers connected through IB (Mellanox HCAs).

I've tried to test two different scenarios:

1) RDMA write to RAM --> causes PCIeItoM counter increase

2) RDMA write directly to Prefatchable BAR on other PCIe device --> causes PCIeWiLF  counter increase

So your explanations seem inline with my results. While writing to RAM it goes to cache and to other PCIe it skips cache (and goes to LineFill buffer???)

Thanks again,

Roman

 

I had not considered peer-to-peer operations when I was thinking about how to interpret these two events.

Since these are LLC CBo events, I think it still makes sense to interpret the word "allocating" to mean that the PCIe write was placed in the cache.   Intel's DDIO documentation says that by default all DMA writes to memory will be written to the L3 cache, so your first results (RDMA writes to DRAM) increment the "allocating" counter as expected.

The second experiment (RDMA write to prefetchable BAR on another PCIe device) is not writing to system memory, so it should not be allocated in the L3 cache, and your results show that this case increments the "non-allocating" counter. 

It is not immediately obvious why the L3 CBo should even take note of peer-to-peer PCIe write transactions.  These transactions clearly pass by on the ring, but in general the PCIe BAR address ranges cannot be cached, so there will be no need to invalidate any lines in the L3 on writes to those address ranges. 

Hypothesis #1: Maybe the CBo counts these events just because it can.  The DDIO functionality means that it has to be able to cache PCIe DMA writes to system memory, so being able to count PCIe DMA writes that it does not have to cache is an obvious extension.  Otherwise you would have to count PCIe DMA writes at the R2PCIe agent and subtract off the allocating writes from the CBo to get a count of the non-allocating writes, and that seems fairly inconvenient.

Hypothesis #2: Maybe the CBo counts these events because there are some circumstances in which the hardware can support (limited) caching of PCIe BAR address ranges (probably not with the Write-Back memory type, but Write-Through and Write-Protect seem plausible), so it might as well count transactions that could (if the MTRRs were different) require L3 tag access to invalidate cached copies of those lines.

 

If your setup contains multiple sockets, it would be interesting to see if the behavior is different when doing PCIe RDMA writes to DRAM buffers allocated on the socket with the IB card attached versus RDMA writes to DRAM buffers allocated on the other socket.   It would also be interesting to see how the counts change when the PCIe peer-to-peer DMA is same-socket vs cross-socket.

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today