Q&A: RDTSC to measure performance of small # of FP calculations

Intel Software Network Support
Total Points:
3,808
Status Points:
3,308
Brown Belt
August 15, 2007 6:20 PM PDT
Rate
 
#3

Here's a link to a related post:  Problem with rdtsc on Pentium D processor

Also, here is another variation we recently received on the same basic question above, included here to increase the probability of this solution coming up in keyword searches:

I am trying to use the RDTSC instruction to time my high performance code.   I was trying to figure out haw many cycles the RDTSC instruction takes (this isn't documented anywhere as far as I can tell).  I have a small bit of assembly code
that demonstrates a problem I'm having.  This compiles and runs fine with both Intel and GNU compilers 3.3, 4.0, etc.

when I compile this and execute under Cygwin (running on Windows XP) and an AMD 4200  I get

./a.exe
ticks per rdtsc 6
 
which isn't 1 or 2, but I can live with 6 clock ticks to process a seldom-called op.

if I compile and run this under Mac OS X (new Apple MacBook Pro) Intel Core 2  I get 65 ?!?!

if I compile and run this on Suse Linux on a Xeon processor, I get 85?!?! (Intel or GNU compilers agree on this)

I'm not even putting in serializing.   does that look right to anyone?

Can anyone verify they get the same results on their x86 machines?

The code:

#include <stdio.H>

 
int main(void)
{
  unsigned long long int t0, t1;
  int result;
  unsigned int ret0[2];
  unsigned int ret1[2];
  __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));
  __asm__ ("xorl %ecx, %ecx "
"L1: "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "rdtsc "
    "addl    $16, %ecx "
    "cmpl    $8192, %ecx "
    "jne    L1");
   __asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1]));
 

   t0 = *(unsigned long long int*)ret0;
   t1 = *(unsigned long long int*)ret1;
   result = (t1-t0)/8192;
   printf("ticks per rdtsc %d " ,result);
   return result;
}

I've run this one various AMD and Intel machines.  most of the AMD machines are returning in 6 to 8 cycles, most of the Intel machines I've tried are returning in 60 to 80 cycles, sometimes as high as 100 cycles.  It would be nice if there was some way to query the time register faster.  This makes performance measuring and tuning a sketchy affair on Intel chips.    Is the rdtsc instruction serializing for some reason (draining the pipelines....) ?

Our engineers agree this question is also addressed by the solution in the first Q&A above.

==

Lexi S.

Intel(R) Software Network Support

http://www.intel.com/software

Contact us

 



Intel Software Network Forums Statistics

8491 users have contributed to 31629 threads and 100769 posts to date.
In the past 24 hours, we have 28 new thread(s) 129 new posts(s), and 184 new user(s).

In the past 3 days, the most popular thread for everyone has been Implicite multithreading ??? The most posts were made to Crash when loading skeleton The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member shadowwolf99