I'm experimenting with tuning a few different approaches for software prefetching and would benefit from some information or advice. For example, how can I count occurence of load/prefetch instruction issue stalls due to full load buffers in the micro-architecture (LD_BLOCKS.ALL_BLOCK?). I presume prefetch requests go through the load buffers? I'd like to measure the number of L1 (*), L2 (L2_DATA_RQSTS. DEMAND.*) and LLC (*) misses due to demand loads in order to try to determine which software prefetching scheme is better. I am happy to use PAPI/raw MSR approaches or an Intel Amplifier based method.
ie I'm doing some tuning of software prefetching and am seeking a performance counter approach to identify potential issues.
I know I can probably turn off hardware prefetching in the BIOS but have not tried this yet in my runs.
Advice on the specific events (*) to count and/or calculations using multiple events that can be performed would be really useful. I'd like to separate demand from prefetch misses in order that I can try to tune my scheme.