pcm.x versus integrating pcm into our code

pcm.x versus integrating pcm into our code

matt_garman's picture

We have a Linux application with multiple threads.  But we are only concerned with the performance of one thread.  What we do then is use the "isolcpus" kernel parameter to isolate all but one core per package from the scheduler.  In other words, this should give us a bunch of "pristine" cores that shouldn't be subject to scheduler-induced jitter.  Using the pthread_setaffinity_np() call, we bind the critical thread to one of these isolated cores.  No other process or thread is allowed to run on this core.

Now, we've looked at PCM info for this particular core+thread in two ways: (1) by running our program under pcm.x, and (2) by integrating the pcm code directly into our program.

This is how we accomplished (2): the critical thread is basically an event-driven loop.  Before we enter the loop, we declare a fixed array of a struct.  The struct is just a bunch of ints or doubles that hold the values returned by the PCM functions (e.g. getL3CacheHitRatio(), getRelativeFrequency(), etc).  So our event function looks kind of like this:

1. state1 = getCoreCounterState(core) // where core is the isolated core to which we've pinned this thread

2. process event

3. state2 = getCoreCounterState(core)

4. fill current struct by calling getL3CacheHitRatio(state1, state2), getL2CacheHitRatio(state1, state2), etc.

5. increment struct index

When the program ends, we print out the contents of all these structs, then have a script which processes all the data, giving us basic statistics (mean, median, min, max, std deviation) on each of the counters.

We would expect the result to be similar to simply running our program under pcm.x, but that's not what we're seeing.

My question is: is there an error in our approach to the way we're using the PCM functionality?  If not, why might we see different results doing this versus using pcm.x?

Thanks!

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Roman Dementiev (Intel)'s picture

Matt,

how and what numbers do you compare? In (1) (running our program under pcm.x) pcm.x prints the metric for the whole duration of your program. In (2) you seem to compute (mean, median, min, max, std deviation) for many small periods of time during the execution of your program.

Could you give an example? I was thinking in direction of "the average of averages is not the average"...

Thanks,
Roman

matt_garman's picture

The stats (mean, median, etc) aren't computed per-iteration, they are computed across all iterations.

So, for example, say there are three iterations (there are usually 100s, if not 1000s, but for the sake of illustration...). Then we'd have three values for L3HitRatio, three values for EXEC, three values for L2CacheMisses, etc. IOW, three values for all of the PCM counters.

At the end of program execution, we print three lines (one per iteration), and each line has all the PCM values that were collected during that iteration.

A script then reads the printed output from the program, and computes stats (mean, median, etc) at the "column" level. I.e., those stats are computed for all three of the L3HitRatio values; all three of the EXEC values, the three L2CacheMisses, etc.

Clear as mud? :)

So for some PCM values, such as L3HitRatio, the metric is somewhat of a "average of averages", but for other absolute values (e.g. L2CacheMisses), it's not.

I certainly don't expect exactly the same numbers, but do expect roughly similar numbers.

Roman Dementiev (Intel)'s picture

Matt,

For the metrics that are not "average of average", what differences do you see: 1.00-1.05x, <1.5x, or >2x ?
What is the running time of a single iteration? Reading the PMU state from the core has a measurable overhead if your iteration time is too short. On my Linux server system it takes 20-80 microseconds for PCM to read a core PMU state. This involves several calls to OS driver, sometimes context switches, executing rdmsr instruction in the kernel, etc. For very short application code this can introduce noise and can offset performance metrics (for example you might always do more or less cache misses per instruction).

Thanks,
Roman

Login to leave a comment.