I've been trying to understand how bus and memory performance counters actually work. My first question is regarding BUS_TRANS_BRD. According to the definition it counts the number of "burst read" transactions, including L1 data cache read misses, L2 hw prefetches and IFU misses. This is not clear enough. Does it count the L1 requests that are missing L2 as well? If that is the case, then why should a multi-threaded program running on two cores that are sharing a L2 cache produce no cache miss (L2_LINES_IN) but a lot of BUS_TRANS_BRDs:
L2_LINES_IN is about 1000,
BUS_TRANS_BRD is about 100 million.
If this counter is only counting L1 data read misses, then why a single threaded program with small data (large enough not to fit in L1, about 100K) is not producing any BUS_TRANS_BRDs?
My second question is about L2_LINES_IN and MEM_LOAD_RETIRED:L2_LINE_MISS. According to definitions L2_LINES_IN counts the number of allocated lines in L2 and L2_LINE_MISS counts the number of loads that missed L2. Based on these definitions L2_LINES_IN must always be greater than L2_LINE_MISS. Because whenever a load is missing the cache, a line must be allocated in L2 cache. However, for my multithreaded application, with two threads sharing a L2 cache, there are no cache misses (L2_LINES_IN) because the data is small and it fits L2, but there are plenty of MEM_LOAD_RETIRED:L2_LINE_MISSes (90 million). How can this be?
The nature of my multithreaded program is to allocated a shared array (an integer array of length 25 (or 40K in another test)), then spawn two threads that try to swap the numbers in this shared array. Before the swapping they lock the data (spinlock, one lock per item in the array) and then they do the swapping.
Bus transactions and cache miss