How is the collection of MSR registers

How is the collection of MSR registers

Hi all

Please explain one uncertain
That collect MSR/PCM counters ?

As I understand it, I can collect counters through the perf or through driver on Linux allows to read and write in the MSR registers.

Collected counters show the number of events for each thread (counters binding with thread) or the total number of events occurring in the device ( without binding to the thread, for example the total number of load or store event of all threads of one processor/core without bindig to thread) ?

If MSR counters binding to the thread, whether this means that when you switch context OS(or hardware) save MSR registers in some buffer, and then again when loading context when MSR registers loaded into some buffer ?

This method should theoretically make the overhead of profiling.

What am I wrong?

Sorry for my english

13 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Patrick Fay (Intel)

Hello Black,

Sorry to delay answering your question.

I'm not really sure what you are asking though. I'm not really sure if you are asking about PCM specifically or just perfmon MSRs in general. But here is an attempt at an answer:

1) Do counters show events per thread or events per device? It depends on the counter. The 'scope' of the counter tells you what area of the chip the counter can access. Some counters (like the general cpu counters) count events at the cpu level (or they can count at the core level). So to read these counters you would need to bind to the particular cpu whose counter you want to read. Other counters have 'processor' scope. That is, the value in the counter is the same no matter which cpu you read it from. Other counters might have uncore scope, or ring scope, etc.

2) when you switch context on a cpu, do you have to save the MSRs and load the MSRs back when you switch back in? This is an area with lots of discussion. Generally the OS doesn't save off the perfmon (hw performance counter related) MSRs and so the OS doesn't reload them. In general, if the perfmon counters are in use, they just stay running regardless of which thread gets swapped in/out of the cpu. You can tell the counters to only count ring0 events or ring3 events or both but you can't tell the counter to "only count for my thread".

Hope this helps,

Pat

Thanks Pat

I assumed it, but wanted to hear a professional opinion

 

Ritratto di iliyapolak

>>>they just stay running regardless of which thread gets swapped in/out of the cpu.>>>

For heavily loaded system it could skew the results.I was thinking about the boosting priority of currently executed thread(beign profiled) to realtime priority in order to keep it pinned to the core until the measurement is over.

Ritratto di Patrick Fay (Intel)

Citazione:

iliyapolak ha scritto:

>>>they just stay running regardless of which thread gets swapped in/out of the cpu.>>>

For heavily loaded system it could skew the results.I was thinking about the boosting priority of currently executed thread(beign profiled) to realtime priority in order to keep it pinned to the core until the measurement is over.

Hello iliyapolak,

When you say 'it could skew the results', if by "it" you mean "the running counters" then this is not something to worry about. I don't think anyone has been able to show/measure any extra overhead from having the counters running. Any extra overhead comes from reading the counters. But the overhead of utilities like PCM is pretty low (probably 1-20 milliseconds per iteration... but it has been a while since I checked it).

For utilities like PCM which run in 'counting mode' (where you just read the counters after sleeping for 1 second or so), if the system is heavily loaded then usually the worst that happens is that PCM won't run exactly when you want it to... so you don't get exactly 1 second intervals for instance.

Utilities like VTune, which run in 'sampling mode' (where you take an performance monitoring interrupt (PMI) each time a counter overflows) can induce a lot of overhead if you sample too frequently. Usually 1000 PMI/second has very small perturbation of a system. Usually when I run something like VTune (or perf in sampling mode) I measure performance of my app with and without sampling to make sure I'm not modifying the performance of my app more than intend.

Pat

Ritratto di iliyapolak

No I was not talking about the overhead of the measurement.I meant that result will not be accurate because counters will not be pinned to some thread or currently executing thread.So when the thread will be swapped out the counter state will not be saved by OS and next ready thread will cause the counter to be incremented.

Ritratto di Patrick Fay (Intel)

Yes, if one doesn't keep track of from which cpu one is reading the counters, then one can get garbage results.

PCM and other utilities handle this by pinning to a specific cpu before they read the counters. This way we know, if we need to say, get the difference of the current and previous value of the counter, that we are subtracting the correct cpu's counter.

Ritratto di iliyapolak

Probably done by calling SetProcessAffinityMask on Windows.

Sorry again to delay answering

The question arose when I started thinking about whether profiling considered noise of operating system, device drivers and system applications.

 

Thanks for the interesting answers

Ritratto di iliyapolak

Hi Black S

What do you mean by writing "noise of operating system"?

if the accumulated number of load and store in memory, double-precision operations, the operating system, system services should also contribute to the total number of events. It is clear that the percentage is very small

Ritratto di John D. McCalpin

One does have to be careful with using MSRs to access performance-related information because the overheads can be relatively large and the standard access mechanism (at least in Linux: /dev/cpu/*/msr/) has no API for reading lists of target registers with a single call to the driver. 

It is easy enough to run a case multiple times with different (known) amounts of "work" and subtract the counts to estimate the overheads, but it would be a lot of work to get a solid understanding of the intrinsic variability of the overhead in terms of all of the performance events that you might want to measure.   PCI configuration space accesses and general MMIO accesses are possibly even worse than MSR accesses in terms of overhead, but I have been afraid to measure these.   One result that I recall is an average overhead of something like 7 microseconds to read an MSR on the same chip where my process is running (using code based on "rdmsr.c" from msrtools-1.2) and 10 microseconds to read an MSR on the other chip in a two-socket system.    Note that each Xeon E5-2600 family processor chip has 83 performance counters defined in the uncore (if I added the numbers in Table 1-1 of the Xeon E5-2600 series Uncore Performance Monitoring guide correctly), with Table 1-2 showing that 41 of these are in MSR space and Table 1-3 showing that the remaining 42 are in PCI configuration space.  Not all of the counters are likely to be useful in any single measurement scenario, but it is very easy to imagine wanting to read all 32 CBo counters and all 16 of the programmable IMC counters at once.  With the existing kernel interface it would probably take O(1000 microseconds) per chip -- corresponding to about 25 million aggregate core cycles (8 cores * 3.1 GHz).   That is an unpleasant amount of overhead for any methodology except whole-program monitoring.

Even building a dedicated kernel module to retrieve all of these counters in a single call would not provide a mechanism that anyone could reasonably call "lightweight" (though I will probably have to do it just to find out how bad it is).

So instead of being able to install in-line instrumentation in my codes when I need to access uncore counters, I have to build a specialized test code that I hope does the same thing as the application, but does it a programmable number of times so that I can apply whole-programming monitoring to a set of extended executions.  Obviously this requires a lot of work and is only practical if I already understand what the target code is doing.   It would be much nicer if the uncore performance counters could be mapped to core performance counters and then read in user-space.  My measurements of RDPMC overhead are in the 10's of cycles in user space -- much more practical than the 10's of thousands of cycles for driver calls to get MSR or PCI configuration space values.

John D. McCalpin, PhD "Dr. Bandwidth"
Ritratto di iliyapolak

>>> the operating system, system services should also contribute to the total number of events. It is clear that the percentage is very small>>>

Yes that is true.

Accedere per lasciare un commento.