We've had some problems in understanding performance counter results
on P4/Xeon (collected with VTune 2.0 on Linux). Maybe this is just a
series of misunderstandings with the documentation, but anyway:
1) In the IA32 Architecture Optimization document it is said that P4's
hardware counter "2nd Level Cache Read Misses" has bugs that can cause
miscounting by a factor of two. Since the measurements for same code with
same data size delivers reproducable counting results with vtune, this
bug has to occur under specific circumstances. Is anything known about
under which circumstances this bug occurs? There are some algorithms that
seem to result in reliable counts, other algorithms are obviously
miscounted. It would be great if a correct result could be drawn out of the
measurements and some assumptions or estimations.
2) If data is loaded that is not in 2nd Level cache, the cache loads two
cache lines from memory. Is that counted as one or two events
for 2nd level Cache Read Misses? And which counters count L2 cache
3) As "2nd Level Cache Load Misses Retired" counts the Loads from L2 Cache,
which caused a cache miss, and "2nd Level Cache Read Misses" counts the
memory load misses as seen by the bus queue (VTune Reference), can it be
assumed, that - including some error concerning instruction loads
a.s.o - the difference of both are a measure for 2nd Level Cache Write
Misses? If not, how else could write misses be determined?
4) In the P4 Architecture Optimization document it is noted that
P4's event "2nd Level Cache Load Misses Retired" 'is known to undercount
when loads are apart'. Could you explain why/when this occurs,
and give an estimation of the factor by which it is undercounting
for some specific code example?
Can anyone help us?