I am running a blocked MM code on a Haswell server.
Performance counter stats for 'taskset -c 0 binaries/matmul/matmul_tiled_sse_128.12.1536':
17,354,925,885 cycles [25.03%]
47,163,342,001 instructions # 2.72 insns per cycle [30.03%]
15,629,057,840 L1-dcache-loads [30.03%]
2,125,485,010 L1-dcache-load-misses # 13.60% of all L1-dcache hits [30.03%]
1,179,936,318 r53e124 [30.03%]
22,469,151 r532124 [30.03%]
49,919,592 r504f2e [19.99%]
4,875,407 r50412e [20.19%]
189,994,023 LLC-prefetches [20.17%]
Specifically I want to see how blocking affect L1,L2 and L3 reference/misses.
I used perf list and selected the following events for L2
Umask-00 : 0x21 : PMU : [DEMAND_DATA_RD_MISS] : None : Demand Data Read requests that miss L2 cache
Umask-01 : 0x41 : PMU : [DEMAND_DATA_RD_HIT] : None : Demand Data Read requests that hit L2 cache
Umask-12 : 0xe1 : PMU : [ALL_DEMAND_DATA_RD] : None : Any data read request to L2 cache
However, the numbers do not make sense to me.
First, amount of L2 reads (r53e124) is lower than l1-dcache-misses. I checked l1-icache-misses as well. But L2 reads exceeds the sum by a large amount. One reason can be L1 miss colaescing, where processor sends lot of L1 miss requests quickly to L2 and all belong to same cache line. Since it is a matrix multiplication code, those kind of patterns are expected? Is it right way to explain those numbers?
Second, L3 refereces (r504f2e) is much higher than L2 misses (r532124). I can't think of any reason here.
Are I thinking in the right direction? Have I chosen the right hardware counters?