In our software we have a routine that does some calculation over 520 bytes data blocks. Logic is simple. There is a giant loop, logic selects one block, rather randomly than sequentially, and executes calculation over it, then returns to beginning, selecting next block from the pool.
There is suspicion that execution is suboptimal. And HW prefetcher abuse memory bandwidth. 520 bytes as block size is big enough to trigger HW prefetching, but it is too short to make it useful and prevent from requesting memory references beyond blocks bounders.
I would like to verify this hypothesis. For this I need to measure amount of memory requests that were initiated by HW prefetcher, but eventually never demanded by actual code, none of the instruction in calculation uses prfetched values.
Is there a way to do so?