Data Linear Address Tracing - Counter Reset Endless Loop

Data Linear Address Tracing - Counter Reset Endless Loop

Hello together,

I'm trying to generate a simple memory trace with the Data Linear Address Profiling of the new Haswell processor generation but when I try to reach a high precision it runs into an endless loop of counting its own routine as it appears.

The Hardwarecounter is set to count on the event MEM_UOPS_RETIRED.ALL_LOADS

  • UMask = 0x82
  • EventSelect =  0xD0

Then optimally the PEBS events are enabled to collect address details and store them in my buffer. Therefore I would program the PEBS.CounterReset to 0xFFFFFFFFFFFE, which leads to the described endless loop. But when I set the PEBS.CounterReset to 0xFFFFFFFFFFE0 it works but only counts on every 32nd event occurance which doesn't provide the accuracy I'm trying to achieve.

  • 0xFFFFFFFFFFFE (endless pebs loop)
  • 0xFFFFFFFFFFE0 (works as expected)

Is this the right approach to generate a simple memory trace of a running process? Is this a bug for the specific event? Are there any workarrounds?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

 >>>high precision it runs into an endless loop of counting its own routine as it appears>>>

What does it mean "counting its own routine"?Are you talking about the CPU's internal microcoded counter routine?

I can't speak for Intel, but I don't think that this is how the data linear address facility is intended to be used.

The data linear address facility (like the other PEBS facilities) is intended for (pseudo-)random sampling, while providing more detailed information about the specific loads or stores being sampled.   Each sample generates an interrupt into the kernel, which then has to extract the information from the PEBS registers and send it back to the user-mode routine.   This is a very heavy-weight operation, not even counting the stack unwinding that the user-mode library will want to do so that it can attribute the information to the correct call stack.

If you want to generate a trace of all of the linear addresses used by a program, it would be much more efficient to instrument the binary using something like Pin (http://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrument...).  

John D. McCalpin, PhD
"Dr. Bandwidth"

Hello Tobias,

Dr. McCalpin is right. With any sampling mode monitoring, if you increase the frequency of sampling, you start to just "measure yourself measuring". Or as Heisenberg noted, the act of measuring something changes the thing being measured.

The PEBS data usage info is helpful for profiling memory usage and identifying hot spots, false sharing, etc but isn't really suited to 'generating a simple memory trace'. Maybe if you tell us more detail on what exactly you are trying to do, then we can make more useful suggestions. 'Pin' is probably the right route but 'a simple memory trace' can mean so many different things...

Pat

Thank you all for your replies so far.

Quote:

Patrick Fay (Intel) wrote:

The PEBS data usage info is helpful for profiling memory usage and identifying hot spots, false sharing, etc but isn't really suited to 'generating a simple memory trace'. Maybe if you tell us more detail on what exactly you are trying to do, then we can make more useful suggestions. 'Pin' is probably the right route but 'a simple memory trace' can mean so many different things...

I'm trying to generate a memory trace for a real-time application running in its own kernel so pin doesn't seem to be suitable for my usecase, right?

The goal is to make a locality analysis of my application that is independant of the microarchitecture it is running on so simply counting the cache misses is not enough. That is the reason why I went trying to use PEBS events, which has been covered by a master thesis in 2009  "Generating Miss Rate Curves with Low Overhead Using Existing Hardware" (Tom Walsh).

Is there a better, more suitable approach to generate this kind of memory traces?

Best Reply

The close comparison to the mentioned in previous posts scenario could be performing running-time measurement(not taking into account context switching and core throttling) of some very short piece of code with the help of RDTSC instruction whose latency will be greater than the measured code.From the point of view of the hardware the code runs in the "shadow" of RDTSC instructions, thus exact time calculation in term of cpu cycles is not effective and bears no precision.

I suppose that in case of PEBS measurement as described in the first post when increasing the frequency of measurement which itself is lengthy and complex operation in terms of cpu cycles the miniscule details of the measured operation tend to be lost and are overshadowed by the latency of measurement code.

 

Hello Tobias,

Is this resolved? It is kind of an old thread now.

Pat

Leave a Comment

Please sign in to add a comment. Not a member? Join today