We have a Linux application with multiple threads. But we are only concerned with the performance of one thread. What we do then is use the "isolcpus" kernel parameter to isolate all but one core per package from the scheduler. In other words, this should give us a bunch of "pristine" cores that shouldn't be subject to scheduler-induced jitter. Using the pthread_setaffinity_np() call, we bind the critical thread to one of these isolated cores. No other process or thread is allowed to run on this core.
Now, we've looked at PCM info for this particular core+thread in two ways: (1) by running our program under pcm.x, and (2) by integrating the pcm code directly into our program.
This is how we accomplished (2): the critical thread is basically an event-driven loop. Before we enter the loop, we declare a fixed array of a struct. The struct is just a bunch of ints or doubles that hold the values returned by the PCM functions (e.g. getL3CacheHitRatio(), getRelativeFrequency(), etc). So our event function looks kind of like this:
1. state1 = getCoreCounterState(core) // where core is the isolated core to which we've pinned this thread
2. process event
3. state2 = getCoreCounterState(core)
4. fill current struct by calling getL3CacheHitRatio(state1, state2), getL2CacheHitRatio(state1, state2), etc.
5. increment struct index
When the program ends, we print out the contents of all these structs, then have a script which processes all the data, giving us basic statistics (mean, median, min, max, std deviation) on each of the counters.
We would expect the result to be similar to simply running our program under pcm.x, but that's not what we're seeing.
My question is: is there an error in our approach to the way we're using the PCM functionality? If not, why might we see different results doing this versus using pcm.x?