Intel Xeon X5365 8-core Processor PAPI performance analysis anomaly

Intel Xeon X5365 8-core Processor PAPI performance analysis anomaly


I am a student working on analysis of libraries for their performance using hardware counters.

I am using Intel Xeon X5365 8-core processor. I am reading the hardware counter values using PAPI code written in C.

For simple code initializing 8192 integers in order to fill up the L1 cache 32 kB (64 byte line size, so 16 integers * 512 cache lines = 8192 integers).

The number of L1 cache misses after initialization measured using PAPI are 93. Assuming that hardware prefetching causes the compulsory misses to reduce from 512 to 93, this may be correct.

Then I am reaccessing the same array of 8192 values again in separate loop. This should ideally give me zero cache misses or atmost few number of L1 cache misses due to program code swapping as most of my 8192 integers are already in L1 cache.

But surprisingly, the number of L1 cache misses after reaccess loop is huge ! The number of L1 cache misses is 652 even after the data being in cache.

Can someone explain why this is happening. The same measurement for an AMD processor results in close to expected L1 cache misses. Is Intel cache data allocation performed differently ?

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here is the answer from performance tuning and HW expert:

Don't use PAPI. The generic events defined in the package do not count what you think (nor the authors think) they do.

In fact the entire concept of generic events is intrinsically flawed. Processor architectures are not interchangeable.

I have never seen a correct analysis based on PAPI.

Use PTU.

Leave a Comment

Please sign in to add a comment. Not a member? Join today