Can't see my previous message, so reposting it...
I'm interested in Branch Trace Store feature of Pentium 4 processor. But it seems to be very slow, on my computer (CPUID: 0F29) "loop $" executes about 380 times slower with BTS active even when the buffers completely reside in L1 cache (Interrupts disabled, HyperThreading turned off). Is there a way to make it faster?
For comparison I tried to implement the same functionality using Single-Stepping on Branches, by my measurements it takes over 1000 clock ticks just to invoke (and return from) debug exception handler, yet about that for 3 necessary MSR accesses (In total 4 times slower than with BTS). Why those apparently simple things work that slow?
Other questions concerning BTS:
Does it make sense to use WC memory type for large BTS buffers in order to avoid cache pollution?
CPUID instruction reference mentions CPL Qualified Debug Store feature (DS-CPL flag) is it implemented on any processors?