ist there an in-depth explanation of the timely interaction of performance counters (esp. cache miss counters) with the rest of the code? Maybe a specific section in App.B of the Optimization Reference Manual I have missed so far?
(pmc configured for counting L1D cache misses) rdpmc (store eax) mov xmm0, [esi] // read from [esi] mov xmm1, [edi] // read from [edi] rdpmc
Now assume, that esi and edi both point to the same location, which initially is not in L1. Then, which difference of the L1 pmc will be observable?
And why? IMHO there are a lot of things (pipelining, out-of-order-execution, stalling), which can influence the result. Is this documented?
Thanks for your help