precision of CPU_Time and System_Clock

88 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Iliya,

I just looked at my sources and I found the following comment:
...
// Overhead of Sleep( 0 ): Debug~=1562 clocks / Release~=1525 clocks
...
So, it is clrear that CPU will do something during that period of time. Wouldn't be better to discuss all that C/C++ stuff in another thread in a different forum?

iliyapolak,

There are a number of attributes of the timing routines I have investigated, including:
- How fast it runs: The number of processor cycles a call to this timing routines takes.
- How precise it is: How frequently the returned time measure is updated. This indicates how useful this timing can be for short duration events.
- How accurate it is: The accuracy of the reported time over a longer period. I have not concentrated on this aspect of performance.

My interest in how many processor cycles the call takes has not been concerned with what happens in the timing routine when it is called. Your discussion with Sergey about Kernel scheduler etc, which I understand is what is taking place in the timer routine, does not have a significant effect on the way I use these routines.

Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision of some timers has not matched this improvement, to the extent that they now give poor performance for what program developers require of them.

The purpose of my post has been to:
- Highlight the poor performance of the standard Fortran intrinsics available in ifort,
- Identify there are better alternatives for SYSTEM_CLOCK, which I hope could be adopted into ifort, and
- Point out that I have not been able to locate a better routine for CPU_TIME.

I was hoping that someone in this Forum might know a suitable routine and be able to provide a simple fortran code example for ifort on how to use it. I remain hopeful someone might be able to help

John

Hi John,

I am not questioning your findings I only asked it as a matter of interest.

Yes I agree with you than Fortran developer should not be concerned with internal implementation of some timing routine.It is not their task.The situation with the precision of system timers  I think that low precision could be directly related in (some cases) to multimedia requirements of the modern OS and to system management(thread scheduling).

Quote:

Sergey Kostrov wrote:

Iliya,

I just looked at my sources and I found the following comment:
...
// Overhead of Sleep( 0 ): Debug~=1562 clocks / Release~=1525 clocks
...
So, it is clrear that CPU will do something during that period of time. Wouldn't be better to discuss all that C/C++ stuff in another thread in a different forum?

Yes that is true.I think that at the time of call to sleep function calling thread could be put immediately in standby state or it could run for some miniscule time period untill scheduling decision is made.What I have been able to understand that on multiprocessor system scheduler database is locked during finding the next runnable thread.So during the long processing time of sleep()  database is locked and no other cpu can make scheduling decision.

If you are interested I can create new thread for this discussion,but which IDZ forum to choose for it?

I might be covering old ground here - but you mention your use of OMP.  On ifort the implementation of OMP_GET_WTIME uses QueryPerformanceCounter.

There are differences in the requirements between SYSTEM_CLOCK and OMP_GET_WTIME in terms of their standard definitions - OMP_GET_WTIME is more relaxed in some ways (it is a thread specific wall time), so that might be part of the reason for the different implementation.  (I see mention of system bugs on the QueryPerformanceCounter msdn page that would be problematic for SYSTEM_CLOCK.) 

Further, Intel's docs ascribe a particular meaning to the zero SYSTEM_CLOCK time.  I suspect if they were to change their implementation from using GetLocalTime to QueryPerformanceCounter they might have to lose that meaning.  Not sure.  If that was the case, that could annoy some users relying on the previously documented behaviour.

Again, this might have already been covered (or be obvious from your table) but CPU_TIME is implemented by calling GetProcessTimes and summing the user and kernel time.  Given its definition I don't see how CPU_TIME could be implemented differently; then given the way the Windows scheduler works and the possibility for the program to have multiple threads on multiple processors, I think it is unrealistic to expect GetProcessTimes to have better precision than it does.

(The reason that GetTickCount is pretty snappy cycle wise is that the tick count is available in user space - no kernel mode transition there.)

>>>(The reason that GetTickCount is pretty snappy cycle wise is that the tick count is available in user space - no kernel mode transition there.)>>>

Yes that's true.I have found a possible implementation of GetTickCount and this function accesses SharedUserData structure in its caller process address space hence the very fast execution time.I was simply confused by existence of KeGetTickCount which is used by drivers.

Thanks for va;uable information.

>>...I can create new thread for this discussion,but which IDZ forum to choose for it?..

Since this is Not related to Intel software it would be nice to create in:

Watercooler Catchall
software.intel.com/en-us/forums/watercooler-catchall

Quote:

Sergey Kostrov wrote:

>>...I can create new thread for this discussion,but which IDZ forum to choose for it?..

Since this is Not related to Intel software it would be nice to create in:

Watercooler Catchall
software.intel.com/en-us/forums/watercooler-catchall

threading forum is not necessarily restricted to Intel software if it concerns Intel platforms

>>...Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision
>>of some timers has not matched this improvement, to the extent that they now give poor performance for what program
>>developers require of them...

Here are results of three tests ( implemented in C with inline assembler ) on different CPUs:

Intel(R) Core i7-3840QM 2.80GHz ( 4 cores / Ivy Bridge )
...
Test-Case 1 - Overhead of RDTSC instruction
...
RDTSC Overhead Value: 24.000 clock cycles
...

Intel(R) Atom(TM) CPU N270 1.60GHz ( 2 cores / Atom )
...
Test-Case 1 - Overhead of RDTSC instruction
...
RDTSC Overhead Value: 24.000 clock cycles
...

Intel Intel(R) Pentium(R) 4 CPU 1.60GHz ( 1 core / Pentium )
...
Test-Case 1 - Overhead of RDTSC instruction
...
RDTSC Overhead Value: 84.000 clock cycles
...

More detailed with a screenshot...

...
Test-Case 1 - Overhead of RDTSC instruction
REAL TIME
TIME CRITICAL
RDTSC Overhead Value: 24.000 cycles

Test-Case 2 - Switching CPUs at runtime
Switched to CPU1 - Previous Thread AM: 255 - Error Code: 0
Switched to CPU1 - Previous Thread AM: 16 - Error Code: 0 - Thread Affinity: 1
Switched to CPU2 - Previous Thread AM: 1 - Error Code: 0 - Thread Affinity: 2
Switched to CPU3 - Previous Thread AM: 2 - Error Code: 0 - Thread Affinity: 4
Switched to CPU4 - Previous Thread AM: 4 - Error Code: 0 - Thread Affinity: 8
Switched to CPU5 - Previous Thread AM: 8 - Error Code: 0 - Thread Affinity: 16
Switched to CPU6 - Previous Thread AM: 16 - Error Code: 0 - Thread Affinity: 32
Switched to CPU7 - Previous Thread AM: 32 - Error Code: 0 - Thread Affinity: 64
Switched to CPU8 - Previous Thread AM: 64 - Error Code: 0 - Thread Affinity: 128

Test-Case 3 - Retrieving RDTSC values for CPUs - 1
RDTSC for CPU1 : 40122001028576
RDTSC for CPU2 : 40122001036608
RDTSC Difference: 8032 ( RDTSC2 - RDTSC1 )
dwThreadAMPrev1 : 128 ( Processing Error if 0 )
dwThreadAMPrev2 : 1 ( Processing Error if 0 )

Test-Case 4 - Retrieving RDTSC values for CPUs - 2
Threads 1 and 2 created
RDTSC values ( in CPU clocks ):

Iteration Thread 1 Thread 2 Difference
00 40135961623344 40135961623763 -419
01 40135961623372 40135961623815 -443
02 40135961623400 40135961623851 -451
03 40135961623440 40135961623907 -467
04 40135961623468 40135961623935 -467
05 40135961623496 40135961623963 -467
06 40135961623544 40135961624003 -459
07 40135961623568 40135961624031 -463
08 40135961623596 40135961624055 -459
09 40135961623624 40135961624083 -459
10 40135961623664 40135961624123 -459
11 40135961623688 40135961624151 -463
12 40135961623716 40135961624183 -467
13 40135961623764 40135961624235 -471
14 40135961623800 40135961624263 -463
15 40135961623836 40135961624291 -455

Statistics:
Thread 1 started at 40135961623316
Thread 2 started at 40135961623511
Difference -195

Thread 1 completed at 40135961623896
Thread 2 completed at 40135961624335
Difference -439

dwThreadAMPrev[0]: 255 ( Processing Error if 0 )
dwThreadAMPrev[1]: 255 ( Processing Error if 0 )
...

Attachments: 

AttachmentSize
Downloadimage/jpeg rdtscoverhead.jpg124.15 KB

>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

By the way, did you read that article?

Nanosecond-precision Test
Web-link: zeromq.org/results:more-precise-0mq-tests

>>...Over the last 20 years, processor rates have improved by over 1,000 times from 1 mhz to 3 ghz. Unfortunately the precision
>>of some timers has not matched this improvement, to the extent that they now give poor performance for what program
>>developers require of them.

I think that John is talking mainly about the system timers like RTC.

iliyapolak,

Of the 6 timers I tested, 4 are updated 64 times per second, which includes both Fortran intrinsics. That is once evert 42 million processor cycles, which contrasts dramatically with the CPU clocks reported in Sergey's recent posts. The following table summarises teh performance I have obtained from the program I attached recently, where:

Routine : is the name of the timing routine being tested ( Fortran intrinsic or API routine)
Elapse : is the duration of the test
Ticks : is the number of changes of time reported during the test
Calls : is the number of times the routine has been called during the test,
Time1 : is the time between tick cycles, timed with QueryPerformanceCounter
Time2 : is the time between tick cycles, timed with test routine
inc : is the average increment in timing value, reported by the test routine
ticks/sec : is the average ticks per second (time changes per second)
cpu : is the estimate of how long the routine takes to return the time value, in processor cycles

Note : my processor runs at 2.67 ghz, so the slowest (CPU_Time) returns the estimate in 0.15 micro seconds or 6.8 million calls per second. This is not a significant overhead for polling the time at a friequency of more than 64 cycles per second.
RDTSC looks like a much better timer for elapsed time estimates.

  Test  Routine               Elapse     Ticks      Calls     time1     time2       inc ticks/sec   cpu
    1  system_clock           5.070       324   40845007 1.565E-02 1.565E-02     15647        64   331
    2  CPU_Time               5.101       326   34510251 1.565E-02 1.565E-02     15648        64   394
    3  GetTickCount           6.287       402  887790725 1.564E-02 1.564E-02        15        64    18
    4  QueryPerformanceCoun   5.632  14625772  283723396 3.851E-07 3.853E-07         1   2595206    52
    5  GetProcessTimes        5.038       322   34090182 1.565E-02 1.565E-02    156486        64   394
    6  RDTSC API              5.289 459827584  459827584 1.150E-08 1.149E-08        30  87008169    30 

Quote:

iliyapolak wrote:

>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

http://software.intel.com/en-us/forums/threading-on-intel-parallel-archi...

Quote:

TimP (Intel) wrote:

Quote:

iliyapolakwrote:

>>>Threading forum is not necessarily restricted to Intel software if it concerns Intel platforms>>>

Tim do you mean Threading Building Blocks forum?

http://software.intel.com/en-us/forums/threading-on-intel-parallel-archi...

Thanks Tim.

This forum seems a good place for my thread:)

>>>Nanosecond-precision Test
Web-link: zeromq.org/results:more-precise-0mq-tests>>>

Great article.Thanks for link.

In the article ( link is posted above ) there is a number for RDTSC overhead on Intel Pentium 4 CPU:
...
Intel(R) Pentium(R) 4 3 GHz ... 33 ns

and if I normalize my result:

Intel(R) Pentium(R) 4 1.60GHz ... 84 clock cycles

to their ( with conversion from clock sycles to ns and GHz normalization ) then this is what I will get:

Normalized: ( 1 / 1600000000 * 84 ) / ( 3.0 / 1.6 ) = ( 52.5 ) / ( 1.875 ) = 28 ns

There is ~15% difference in accuracy of measurements. Unfortunately, it is Not clear how exactly they measured these numbers ( I could post my codes and they are very simple ).

Hi John,

I think that precision of 4 timers(64hz) frequency is directly related to real-time multimedia needs and to OS thread scheduling.For example quantum interval on Windows runs for two timer intervals ~20ms(it can be measured with Clockres tool) for client version so in this case granularity of RDTSC I think is not needed.I think that RDTSC is better option when small portion of code like loops are measured.

Quote:

Sergey Kostrov wrote:

In the article ( link is posted above ) there is a number for RDTSC overhead on Intel Pentium 4 CPU:
...
Intel(R) Pentium(R) 4 3 GHz ... 33 ns

and if I normalize my result:

Intel(R) Pentium(R) 4 1.60GHz ... 84 clock cycles

to their ( with conversion from clock sycles to ns and GHz normalization ) then this is what I will get:

Normalized: ( 1 / 1600000000 * 84 ) / ( 3.0 / 1.6 ) = ( 52.5 ) / ( 1.875 ) = 28 ns

There is ~15% difference in accuracy of measurements. Unfortunately, it is Not clear how exactly they measured these numbers ( I could post my codes and they are very simple ).

RDTSC is so much different on CPUs of recent years than it was on P4 that I don't see much relevance in comparisons with such old data.  Evidently, on recent CPUs, as RDTSC measures baseline clock ticks and multiplies them by the nominal multiplier, it can be expected not to increment during the baseline clock interval.  RDTSC may access a local core clock so as not to have an access latency dependent on which core accesses it, but of course if you take care to return to the same core for consistency you have greater overhead.  The latter might account for behavior of Windows QueryPerformance.

Intel CPUs which have an ETC counter shared by all cores tell you that latency is many times that of RDTSC.

>>...RDTSC is so much different on CPUs of recent years than it was on P4 that I don't see much relevance in comparisons
>>with such old data...

It actually doesn't matter because John ( in Fortran ) and I ( in C/C++ ) are already developed several simple techniques to measure intervals with as better as possible accuracy. There is nothing wrong with using 3 different generations of Intel CPUs in my tests.

John, I wonder if you experienced something like this?

Measured Overhead values of RDTSC instruction are Not the same during a test. Take a look at a log I've recorded:

[ Pentium 4 1.6GHz CPU with Microsoft C++ compiler 32-bit ]
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 88.000 clock cycles
RDTSC Overhead Value: 84.000 clock cycles
...
Note: It happens even after I boosted priority of the thread to realtime and used a critical section around a piece of code that does all measurements. I think this is expected because Windows interrupts my processing to do something else.

Here is another set of results:

Test-Case 1.3 measures how many CPU clock cycles are spent on assignment of a value returned by RDTSC instruction to a 64-bit variable and when a Final RDTSC Overhead Value is calculated it is taken into account, that is subtracted.

[ Tests on Pentium system ]

[ Pentium 4 1.6GHz CPU with Microsoft C++ compiler 32-bit ]
...
Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 84.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 847384512 clock cycles
Avg RDTSC Overhead Value : 84.738 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 4.000 clock cycles
Final RDTSC Overhead Value: 80.000 clock cycles
...

[ Pentium 4 1.6GHz CPU with Intel C++ compiler 32-bit ]
...
Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 84.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 791481152 clock cycles
Avg RDTSC Overhead Value : 79.148 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 4.000 clock cycles
Final RDTSC Overhead Value: 80.000 clock cycles
...

Average RDTSC Overhead Value is 79.148 clock cycles and this is another prove that actual overhead is about ~80 clock sycles instead of 84 clock sycles.

[ Tests on Atom system ]

[ Atom N270 1.6GHz CPU with Microsoft C++ compiler 32-bit ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 24.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 301894880 clock cycles
Avg RDTSC Overhead Value : 30.189 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.000 clock cycles
Final RDTSC Overhead Value: 24.000 clock cycles

[ Atom N270 1.6GHz CPU with Intel C++ compiler 32-bit ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 24.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 306208064 clock cycles
Avg RDTSC Overhead Value : 30.621 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.000 clock cycles
Final RDTSC Overhead Value: 24.000 clock cycles

[ Tests on Ivy Bridge system ]

[ Ivy Bridge 2.8GHz CPU with Microsoft C++ compiler 32-bit ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 24.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 642290240 clock cycles
Avg RDTSC Overhead Value : 64.229 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.000 clock cycles
Final RDTSC Overhead Value: 24.000 clock cycles

[ Ivy Bridge 2.8GHz CPU with Intel C++ compiler 32-bit ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 24.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 463084704 clock cycles
Avg RDTSC Overhead Value : 46.308 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.000 clock cycles
Final RDTSC Overhead Value: 24.000 clock cycles

Regarding Overhead of Assignment.

>>...
>>Min Overhead of Assignment: 4.000 clock cycles
>>...

4 clock cycles means that I managed to measure the time of some processing and it was done in 2.5 nano-seconds (!) on the Pentium 4 1.6GHz CPU.

Sergey,

If you are asking a question on observed variability of:
- call duration, or
- variability in the time parameter increment between successive changes,
I have noted significant variability in my measures of these. For example, in the following code example, modified from my earlier attachment, the variation in the change in NUM, can be calculated for mean and standard deviation, and I have reported when the d_num value deviates significantly from the average. These instances occur with all the timing routines, and some of these deviations from the average can be many standard deviations.
I think there are multiple reasons for this, related to the system running many processes.

      do n = 1,n_lim
          call use_system_clock (time, num)
          if (num == last_num) cycle
          last_num = num         ! last tick
          i        = i+1
          list(i)  = num
          sec(i)   = time
          if (i == size(list)) exit
       end do 

What I do not know, is if these variations effect the long term accuracy of the time measure. In the case of elapsed time, it would depend on how the timer is calibrated and what is the ultimate reference clock. RDTSC could be a reference clock, but there might be others. Any attempt to identify better accuracy is probably not justified, as elapsed time performance measure of a process is probably more affected by other background processes, so it is all part of the noise.
In the case of CPU time, this is an accumulation of the estimate of when the process is running. This has been more relevant when trying to identify lost time, such as I/O interrupt delays when I have tried to better manage buffered I/O performance. It is becoming less significant than elapsed time.

The optimisation problems I am attempting to solve are not at the few processor cycles level, but they are certainly shorter than 42 million processor cycles, which is precision you can get from the Fortran intrinsics.

The short answer is: I have seen the variability of performance in a number of areas, and I think most is due to shared processes. I am not convinced this variability is a significant problem.

John

Ps : I hope this post gets delivered soon, as my past post was held over for review, making it less relevant to the other later posts, when finally released. Not sure what made it qualify for being stopped ?

Also, why can't this windows forum accept windows cut and paste ?

>>...What I do not know, is if these variations effect the long term accuracy of the time measure...

All observed variations happen because out test applications are executed in a multi-threaded environment and you properly assumed that measurements are ''...affected by other background processes, so it is all part of the noise...'.

>>... I have seen the variability of performance in a number of areas, and I think most is due to shared processes...

Could you provide a couple of examples?

My point of view is as follows: I wouldn't be worried if a 15-min processing executed for 15,000 nano-seconds longer. But it is a real problem if a really small computation, for example usually executed in 0.005 seconds, in some cases is executed 2.5 times slower, or so.

>>>I think there are multiple reasons for this, related to the system running many processes.>>>

I think that such a behaviour can be related to thread context switching.This maybe can explain variation of your measurements.Try to run your thread at higher priority level to diminish the frequency of thread context switches and observe how the variation changes.

>>>Also, why can't this windows forum accept windows cut and paste ?>>>

I have never experienced such a issue on my browser.Btw I am using Firefox.

>>>I think this is expected because Windows interrupts my processing to do something else.>>>

I think that this is related to servicing interrupts and to running scheduler code itself.

Sergey,

Attached is an example of testing variability of System_Clock. (The version I had was from another compiler, so hopefully this version works with ifort. I don't have ifort at home. If not I will resubmit on Monday.)
This can be adapted for testing CPU_Time also.

John

Attachments: 

AttachmentSize
Downloadapplication/octet-stream cpu-time-ifort.f908.66 KB

Thanks John for the updated sources.

>>Regarding Overhead of Assignment.
>>
>>>>...
>>>>Min Overhead of Assignment: 4.000 clock cycles
>>>>...

That doesn't look right and it has to be around 1 clock cycle ( for a regular MOV instruction ). Am I wrong?

[ Intel Intel(R) Pentium(R) 4 CPU 1.60GHz ]

[ Microsoft C++ compiler - DEBUG ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 84.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 791947072 clock cycles
Avg RDTSC Overhead Value : 79.195 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 1.273 clock cycles
Final RDTSC Overhead Value: 82.727 clock cycles

I finally have very consistent results for RDTSC Overhead ( Latency ) obtained with Intel C++ compiler.

[ Intel Intel(R) Pentium(R) 4 CPU 1.60GHz ]

[ Intel C++ compiler - DEBUG ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 84.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 885178944 clock cycles
Avg RDTSC Overhead Value : 88.518 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 1.091 clock cycles
Final RDTSC Overhead Value: 82.909 clock cycles

[ Intel C++ compiler - RELEASE ]

Test-Case 1.1 - Overhead of RDTSC instruction
Min RDTSC Overhead Value : 84.000 clock cycles

Test-Case 1.2 - Overhead of RDTSC instruction
Total Delta Value : 791483712 clock cycles
Avg RDTSC Overhead Value : 79.148 clock cycles

Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 1.191 clock cycles
Final RDTSC Overhead Value: 82.809 clock cycles

Notes:

- Results for RDTSC Overhead Value differ by ~0.12% for tests in DEBUG and RELEASE configurations

- Results for Overhead of Assignment differ by ~8.5% - ~9% for tests in DEBUG and RELEASE configurations ( accuracy of measurements decreases for really small time intervals, like a couple of nano-seconds )

I will also provide some additional information how I've measured Overhead of Assignment and it is tricky because Intel and Microsoft C++ compilers don't generate the same number of CPU instructions and disassembled codes demonstrate it.

There are No a Latency number for RDTSC instruction in the following manual.

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-026
April 2012

Page 767
Table C-16a. General Purpose Instructions (Contd.)
...
Throughput of RDTSC instruction:

~28 - 06_2A, 06_2D
~31 - 06_25/2C/1A/1E/1F/2E/2F
~31 - 06_17, 06_1D
...

Note 1:

CPUID Signature Values of DisplayFamily_DisplayModel:

06_3AH - Microarchitecture Ivy Bridge
06_2AH - Microarchitecture Sandy Bridge
06_2DH - Microarchitecture Sandy Bridge ( Xeon )
06_25H - Microarchitecture Westmere
06_2CH - Microarchitecture Westmere
06_1AH - Microarchitecture Nehalem
06_1EH - Microarchitecture Nehalem
06_1FH - Microarchitecture Nehalem
06_2EH - Microarchitecture Nehalem
06_2FH - Microarchitecture Westmere
06_17H - Microarchitecture Enhanced Intel Core
06_1DH - Microarchitecture Enhanced Intel Core

Note 2:

Throughput - The number of clock cycles required to wait before the issue
ports are free to accept the same instruction again. For many instructions, the
throughput of an instruction can be significantly less than its latency.

Sergey,

I updated the program testing variability. As SYSTEM_CLOCK and CPU_TIME intrinsics only tick at 64 cycles per second, I also tested GetPerformanceCounter, which has a much higher rate, so that the variability is more noticeable. The intrinsics do still show some variabliity in the elapsed time test.

The results for my processor show the accuracy of the 3 timing routines tested ( using RDTSC as the reference) are:

 CPU_TIME Variability Test
 Calls per second         =   6629241.52383241    
 Cycles per second        =   64.1021553426797    
 cpu_time accuracy        =  1.560009947643979E-002  seconds
   average RDTSC ticks per cycle   41599668.8534031  
   standard deviation (in ticks)   135195.923755431   
   variability                    3.249927883605512E-003

 System_Clock Variability Test
 Calls per second         =   8672321.84910597    
 Cycles per second        =   64.1081552551243   
 System_Clock accuracy    =  1.559863945578231E-002  seconds  
  average RDTSC ticks per cycle   41599764.0136054  
  standard deviation (in ticks)   80004.3281514126    
  variability                    1.923191874964643E-003

 Query_Perform Variability Test
 Calls per second         =   46191928.3624866    
 Cycles per second        =   2589364.74907529    
 Query_Perform accuracy   =  3.861951084168883E-007  seconds 
  average RDTSC ticks per cycle   1029.87607752155   
  standard deviation (in ticks)   1072.71121520382   
  variability                     1.04159251643689    

The key variability measures which measure the accuracy of when each routine ticks over are:
 cpu_time   :   standard deviation (in ticks)   135,195.     ( 51 microseconds)
 System_Clock :  standard deviation (in ticks)   80,004.  ( 30 microseconds)
 Query_Perform :  standard deviation (in ticks)   1,072.  ( 0.4 microseconds)

This test shows there is some variation in GetPerformanceCounter, but is less effective for CPU_Time and Syetem_Clock, in showing the outriders (significant variation in the time between ticks), due to their long tick duration. However it does show a large variation in their tick rate in comparison to Query_Perform, when measured as time.  All this is based on the assumption of the accuracy of RDTSC.
GetPerformanceCounter displays the effect of other system interuption for reporting it's tick interval.

The purpose of this test was to try and estimate the reliability and accuracy of the tick intervals.

John

 

Attachments: 

AttachmentSize
Downloadapplication/octet-stream cpu-time-ifort.f9018.34 KB
Downloadapplication/octet-stream cpu-time-ifort.log27.32 KB

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today