RDTSC Latency

RDTSC Latency

I've been looking through forum threads related to RDTSC accuracy discussions. For the purposes of discussion, let's assume our assembly code looks like this.
main:
rdtsc
mov esi, eax
rdtsc
sub eax, esi
; eax now has time difference between the first and second rdtsc

I'm curious about why the instruction latency is as long as 100 cycles on P4 and 60 on Xeon 51xx architectures. If it is true that RDTSC is not serializing (as mentioned in the Intel Software Developer's Manual), why should this take that long? Some potential explanations I gleamed from reading the other posts is that this might be due to:
(1) long sequence of microops of RDTSC
http://software.intel.com/en-us/forums//topic/52330
(2) resolution being limited by bus speed
http://software.intel.com/en-us/forums//topic/52330
(3) "synchronization of pipeline" But I thought the manual expressely said that was not happening?
http://software.intel.com/en-us/forums//topic/52482

Any thoughts as to which is the dominant? Perhaps these are not the real reasons? I'd appreciate any help on this.

Thanks in advance...

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

We forwarded this question to several of our engineering contacts. One replied:

rdtsc does not have this sort of granularity.
For best results, make sure you have at least ~1,000 clocks worth of instructions between consecutive rdtsc calls.

The refs cited below are accurate -- i.e. rdtsc is multiple uops, it does not serialize the machine, but it is serializing with respect to itself. the resolution on newer machines (including those with Core2 Duo Processors) is a multiple of bus clocks as well, just as you noted.

All that said, you will have more repeatable results if you have a large number of clocks between successive calls -- I recommend somewhere in the neighborhood of at least 1000.

Another nitpick point: the timestamp counter is a 64-bit quantity. Subtracting only the lower 32 bits is not a safe technique. The code given can occasionally give wild and incorrect results when the counts wrap past 2^32.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Leave a Comment

Please sign in to add a comment. Not a member? Join today