Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions

Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions

*** Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions ***

Zone: 

Thread Topic: 

How-To
34 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

[ Abstract ]

Time-Interval Measurements using TSC

Intel CPU's a Time Stamp Counter ( TSC ) is a special 64-bit register that increments every
clock cycle. Two instructions, RDTSC and RDTSCP, could read a value of TSC into General Purpose
Registers ( GPR ). Intel doesn't provide any information on latencies of these two instructions,
however througputs for both instructions are given in Intel 64 and IA-32 Architectures
Optimization Reference Manual.

[ List of Abbreviations ]

CPU - Central Processing Unit
ILP - Instruction Level Parallelism
TSC - Time Stamp Counter ( number of clock cycles since the CPU is powered on )
GPR - General Purpose Registers
NIL - Native Internal Latency
OEL - Observed External Latency
MAD - Minimal Averaged Delta ( number of clock cycles between two calls to RDTSC or RDTSCP instructions )
DOU - Degree of Uncertainty ( unknowns related to Superscalar processing with ILP )
ATV - Absolute TSC Value
DTV - Difference TSC Value
UTV - Uncorrected TSC Value
CTV - Corrected TSC Value

[ Details ]

There are two point of views among Software Engineers and Computer Scientists if a latency of
RDTSC or RDTSCP instructions, officially not known, need to be taken into account when dealing
with a precise time measurements.

Here is a list of terms that will be used:

- a Native Internal Latency ( NIL ) for RDTSC and RDTSCP instructions
- an Observed External Latency ( OEL ) for RDTSC and RDTSCP instructions
- a Minimal Averaged Delta ( MAD ) for RDTSC and RDTSCP instructions
- a Degree of Uncertainty ( DUO ) of Instruction Level Parallelism of a CPU with a superscalar
architecture

NIL of RDTSC or RDTSCP instructions is a minimal number of clock cycles needed to move
a 64-bit value of TSC to EDX:EAX or RDX:RAX GPRs before the value becomes available
for an external program.

OEL is a minimal difference between two TSC values after two uninterrupted by an OS calls of
RDTSC or RDTSCP instructions and calculated as follows:

OEL = ( TSC2 - TSC1 ) * DOU

where

TSC1 = READ_TSC
TSC2 = READ_TSC

When DOU is set to 1.0 it is assumed that there is no Instruction Level Parallelism and
instructions are executed one after another. Only positive numbers are valid for DOU and
0.0 value of DOU is excluded. DOU is a very empirical number because some instructions are
designed for out-of-order execution by a CPU.

MAD is a number of clock cycles it takes to execute one RDTSC or RDTSCP instruction in
a series of calls to RDTSC or RDTSCP instructions. A series of calls of the same instruction
needs to be executed in order to fill a CPU pipeline and to retire non RDTSC or RDTSCP
instructions. MAD is calculated as follows:

MAD = ( ( TSC2 - TSC1 - SAVE_TSC1_LATENCY ) / NumOfInstructionsToFillPipeline ) * DOU

where

TSC1 = READ_TSC
TSC2 = READ_TSC
SAVE_TSC1_LATENCY is a latency of MOV instruction to save EAX or RAX GPRs

Note: EDX or RDX registers are Not saved to improve accuracy of measurements and it is possible
that overflow of values in EAX and RAX GPRs could happen.

It is a very speculative matter that a NIL of RDTSC or RDTSCP instructions is about 1-2 clock
cycles for 32-bit CPUs and 64-bit CPUs. However, it is clear that Intel CPU micro-codes should
read and move TSC value to GPRs as faster as possible.

A set of properties of a NIL could be as follows:

- NIL is always a constant for a given CPU architecture
- NIL can not be estimated externally because it is not clear how many micro-ops of a CPU are
needed to complete RDTSC or RDTSCP instructions
- NIL can not be measured externally because it is always hidden and is a part of MAD

An OEL is always higher than NIL for a given CPU and could be equal to MAD when DOU is 1.0.

A series of tests implemented in C language with some portion of codes in inline assembler are
completed and MAD values are calculated.

[ Pseudo-code of Tests ]

A pseudo-code of tests to evaluate MAD of RDTSC or RDTSCP instructions is as follows:

SET_PRIORITY_TO_REALTIME
TSC1 = READ_TSC
SAVE_TSC1 ;; Its latency is SAVE_TSC1_LATENCY
;; Fill CPU pipeline
RDTSC() ;; 1
RDTSC() ;; 2
RDTSC() ;; 3
RDTSC() ;; 4
RDTSC() ;; 5
RDTSC() ;; 6
RDTSC() ;; 7
RDTSC() ;; 8
RDTSC() ;; 9
RDTSC() ;; 10
;;
TSC2 = READ_TSC
MAD = ( ( TSC2 - TSC1 - SAVE_TSC1_LATENCY ) / 10 ) * DOU
SET_PRIORITY_TO_NORMAL

where
DOU = 1.0

[ Computer Systems used for evaluations ]

** Dell Precision Mobile M4700 **

Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit SP1
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Display resolution: 1366 x 768

** Dell Dimension 4400 **

Intel Pentium 4 ( 1.60 GHz / 1 core )
1GB RAM
Seagate 20GB HDD ( * )
Seagate 3TB HDD ( ** )
EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card
Windows XP Professional 32-bit SP3
Size of L2 Cache = 256KB
Size of L1 Cache = 8KB
Display resolution: 1440 x 990

( * ) Seagate Barracuda 20GB IDE Hard Disk Drive
ST320011A
3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6
Average Rotational Latency : 4.17 ms
Average Seek Times Read : 9.0ms
Average Seek Times Write : 10.0ms
Maximum Internal Transfer Rate : 69.4MB/sec
Average External Transfer Rate : 100MB/sec ( Read and Write )
Maximum External Transfer Rate : 150MB/sec ( Read )
Note: Barracuda ATA IV Family

( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive
ST3000DM001
3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec )
Average Rotational Latency : 4.16 ms
Average Seek Times Read : 8.5ms
Average Seek Times Write : 9.5ms
Maximum Internal Transfer Rate : 268MB/sec
Average External Transfer Rate : 156MB/sec ( Read and Write )
Maximum External Transfer Rate : 210MB/sec ( Read )

[ List of tests ]

Four tests are completed for every CPU tested with different C++ compilers:

[ Sub-Test002.01.A - RDTSC ] - pure C language

[ Sub-Test002.01.B - RDTSC ] - C language with inline assembler

[ Sub-Test002.01.C - RDTSCP ] - pure C language

[ Sub-Test002.01.D - RDTSCP ] - C language with inline assembler

Four possible use cases for __rdtscp intrinsic function need to be considered. The function is
declared as follows:

...
extern unsigned __int64 __ICL_INTRINCC __rdtscp( unsigned int * );
...

Note: Let's denote uiTscValue as 1st value, and iRetValue as 2nd value.

Use Case 1 - 1st value used / 2nd value used:

...
unsigned int iRetValue = 0;
unsigned __int64 uiTscValue = __rdtscp( &iRetValue );
...

C++ compiler should generate ordered MOV instructions to save 1st value and 2nd value
at some addresses.

Use Case 2 - 1st value used / 2nd value not used:

...
unsigned __int64 uiTscValue = __rdtscp( NULL );
...

C++ compiler should not generate MOV instructions to save 2nd value at NULL address. Currently,
Intel C++ compiler tries to save 2nd value to NULL address and Access Violation exception is generated.

Use Case 3 - 1st value not used / 2nd value used:

...
unsigned int iRetValue = 0;
__rdtscp( &iRetValue );
...

C++ compiler should not generate MOV instructions to save 1st value at some address.

Use Case 4 - 1st value not used / 2nd value not used:

...
__rdtscp( NULL );
...

C++ compiler should not generate MOV instructions to save 1st value and 2nd value at some addresses.

[ CPU: Pentium 4 - Microsoft C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Started
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
Latency of 'MOV ecx, eax' instruction is 1 clock cycle(s)
[ Sub-Test002.01.B - RDTSC ] - Completed

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Pentium 4 - Borland C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.40 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Pentium 4 - Intel C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 81.20 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Started
TSC Minimal Averaged Delta is 80.30 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
TSC Minimal Averaged Delta is 79.90 clock cycles
Latency of 'MOV ecx, eax' instruction is 1 clock cycle(s)
[ Sub-Test002.01.B - RDTSC ] - Completed

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Pentium 4 - MinGW C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Pentium 4 - Watcom C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 80.40 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
TSC Minimal Averaged Delta is 80.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Microsoft C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.80 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 27.40 clock cycles
TSC Minimal Averaged Delta is 28.20 clock cycles
TSC Minimal Averaged Delta is 26.60 clock cycles
TSC Minimal Averaged Delta is 28.20 clock cycles
TSC Minimal Averaged Delta is 26.60 clock cycles
TSC Minimal Averaged Delta is 28.60 clock cycles
TSC Minimal Averaged Delta is 28.20 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Started
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 27.50 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
Latency of 'MOV ecx, eax' instruction is 1 clock cycle(s)
[ Sub-Test002.01.B - RDTSC ] - Completed

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Microsoft C++ compiler - 64-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 26.60 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 25.80 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 25.80 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Borland C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 25.80 clock cycles
TSC Minimal Averaged Delta is 28.30 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Borland C++ compiler - 64-bit ]

[ Sub-Test002.01.A - RDTSC ] - Not Supported

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Intel C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 29.00 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 32.60 clock cycles
TSC Minimal Averaged Delta is 29.60 clock cycles
TSC Minimal Averaged Delta is 28.60 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 37.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 28.20 clock cycles
TSC Minimal Averaged Delta is 25.80 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Started
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 26.70 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
TSC Minimal Averaged Delta is 27.10 clock cycles
Latency of 'MOV ecx, eax' instruction is 1 clock cycle(s)
[ Sub-Test002.01.B - RDTSC ] - Completed

[ Sub-Test002.01.C - RDTSCP ] - Started
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 34.20 clock cycles
TSC Minimal Averaged Delta is 34.20 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 34.20 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 34.20 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
[ Sub-Test002.01.C - RDTSCP ] - Completed

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Intel C++ compiler - 64-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Started
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 33.40 clock cycles
TSC Minimal Averaged Delta is 33.80 clock cycles
[ Sub-Test002.01.C - RDTSCP ] - Completed

[ Sub-Test002.01.D - RDTSCP ] - Started
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.30 clock cycles
TSC Minimal Averaged Delta is 34.70 clock cycles
TSC Minimal Averaged Delta is 34.30 clock cycles
Latency of 'MOV rcx, rax' instruction is 1 clock cycle(s)
[ Sub-Test002.01.D - RDTSCP ] - Completed

[ CPU: Ivy Bridge - MinGW C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Not Supported

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - MinGW C++ compiler - 64-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 28.20 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Watcom C++ compiler - 32-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 25.00 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 24.60 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 26.20 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

[ CPU: Ivy Bridge - Watcom C++ compiler - 64-bit ]

[ Sub-Test002.01.A - RDTSC ] - Started
TSC Minimal Averaged Delta is 26.60 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
TSC Minimal Averaged Delta is 25.40 clock cycles
TSC Minimal Averaged Delta is 27.00 clock cycles
[ Sub-Test002.01.A - RDTSC ] - Completed

[ Sub-Test002.01.B - RDTSC ] - Not Supported

[ Sub-Test002.01.C - RDTSCP ] - Not Supported

[ Sub-Test002.01.D - RDTSCP ] - Not Supported

Examples of disassembler codes for RDTSC and RDTSCP instructions will be posted later.

[ An example of disassembled codes for a test with RDTSC instruction - 32-bit ]

...
0024AA47 rdtsc
0024AA49 mov ecx, eax
0024AA4B rdtsc
0024AA4D rdtsc
0024AA4F rdtsc
0024AA51 rdtsc
0024AA53 rdtsc
0024AA55 rdtsc
0024AA57 rdtsc
0024AA59 rdtsc
0024AA5B rdtsc
0024AA5D rdtsc
0024AA5F sub eax, ecx
...

[ An example of disassembled codes for a test with RDTSCP instruction - 64-bit ]

...
000000013F652A81 rdtscp
000000013F652A84 mov rbx, rax
000000013F652A87 rdtscp
000000013F652A8A rdtscp
000000013F652A8D rdtscp
000000013F652A90 rdtscp
000000013F652A93 rdtscp
000000013F652A96 rdtscp
000000013F652A99 rdtscp
000000013F652A9C rdtscp
000000013F652A9F rdtscp
000000013F652AA2 rdtscp
000000013F652AA5 sub rax, rbx
...

Interesting results.

Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency.

Unfortunately he did not provide any data about potential CPU clock consumption of RDTSC latency.

Why do not you serialize uop of RDTSC execution? 

Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same time and at least partially overlap pipelined execution.

>>...Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency.

That is possible because it looks like he used a different generation CPU. Post these RDTSC and RDTSCP numbers for review with a CPU information.

>>Why do not you serialize uop of RDTSC execution?
>>
>>Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same
>>time and at least partially overlap pipelined execution.

That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.

I am posting here RDTSC reciprocal throughput result as stated by Agner Fog.

CPU Arch:  Ivy Bridge ,  RDTSC Reciprocal Throughput: 27 CPU clock cycles.

Reference p. 175

http://www.agner.org/optimize/instruction_tables.pdf

 

>>>That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.>>>

I am still puzzled by at least some probable (Hardware level) pipelined execution of those 10 micro-ops. I will try to find some information at Google patents which may shed some light on proposed (patented) implementation of RDTSC instruction.

 

I have found an Intel patent titled "Apparatus for monitoring the performance of a microprocessor" and there is no clear information about pipelined read of TSC.

Link to aforementioned article:

https://patents.google.com/patent/US5657253A/en?q=time+stamp+counter&ass...

>>Have you tried this experiment on v4 or v3 cpus? In particular E5-2699 v3 and E5-2699 v4?

Here are results of my tests for Intel Xeon Phi Processor 7210:
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1...

Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name : Intel(R) Xeon Phi(TM) 7210
Packages (sockets) : 1
Cores : 64
Processors (CPUs) : 256
Cores per package : 64
Threads per core : 4

[ Output for RDTSC instruction ]

...
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 37.70 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 37.70 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
Access Time to TSC: 36.40 clock cycles
...

Leave a Comment

Please sign in to add a comment. Not a member? Join today