Here's a link to a related post: Problem with rdtsc on Pentium D processor
Also, here is another variation we recently received on the same basic question above, included here to increase the probability of this solution coming up in keyword searches:
I am trying to use the RDTSC instruction to time my high performance code. I was trying to figure out haw many cycles the RDTSC instruction takes (this isn't documented anywhere as far as I can tell). I have a small bit of assembly code
that demonstrates a problem I'm having. This compiles and runs fine with both Intel and GNU compilers 3.3, 4.0, etc.
when I compile this and execute under Cygwin (running on Windows XP) and an AMD 4200 I get
./a.exe
ticks per rdtsc 6
which isn't 1 or 2, but I can live with 6 clock ticks to process a seldom-called op.
if I compile and run this under Mac OS X (new Apple MacBook Pro) Intel Core 2 I get 65 ?!?!
if I compile and run this on Suse Linux on a Xeon processor, I get 85?!?! (Intel or GNU compilers agree on this)
I'm not even putting in serializing. does that look right to anyone?
Can anyone verify they get the same results on their x86 machines?
The code:
#include <stdio.H>
int main(void)
{
unsigned long long int t0, t1;
int result;
unsigned int ret0[2];
unsigned int ret1[2];
__asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));
__asm__ ("xorl %ecx, %ecx
"
"L1:
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"addl $16, %ecx
"
"cmpl $8192, %ecx
"
"jne L1");
__asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1]));
t0 = *(unsigned long long int*)ret0;
t1 = *(unsigned long long int*)ret1;
result = (t1-t0)/8192;
printf("ticks per rdtsc %d
"
,result);
return result;
}
I've run this one various AMD and Intel machines. most of the AMD machines are returning in 6 to 8 cycles, most of the Intel machines I've tried are returning in 60 to 80 cycles, sometimes as high as 100 cycles. It would be nice if there was some way to query the time register faster. This makes performance measuring and tuning a sketchy affair on Intel chips. Is the rdtsc instruction serializing for some reason (draining the pipelines....) ?
Our engineers agree this question is also addressed by the solution in the first Q&A above.
==
Lexi S.
Intel(R) Software Network Support
http://www.intel.com/software
Contact us