I am a student working on analysis of libraries for their performance using hardware counters.
I am using Intel Xeon X5365 8-core processor. I am reading the hardware counter values using PAPI code written in C.
For simple code initializing 8192 integers in order to fill up the L1 cache 32 kB (64 byte line size, so 16 integers * 512 cache lines = 8192 integers).
The number of L1 cache misses after initialization measured using PAPI are 93. Assuming that hardware prefetching causes the compulsory misses to reduce from 512 to 93, this may be correct.
Then I am reaccessing the same array of 8192 values again in separate loop. This should ideally give me zero cache misses or atmost few number of L1 cache misses due to program code swapping as most of my 8192 integers are already in L1 cache.
But surprisingly, the number of L1 cache misses after reaccess loop is huge ! The number of L1 cache misses is 652 even after the data being in cache.
Can someone explain why this is happening. The same measurement for an AMD processor results in close to expected L1 cache misses. Is Intel cache data allocation performed differently ?