Synchronizing Time Stamp Counter

Synchronizing Time Stamp Counter

Hello everyone!

I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all processors relative to one of them.
I tried rdtsc() but it returned TSC of the current processor. Is there any way to get TSC from the necessary processor? Or may be I can define processor id somewhere and use an appropriate time stamp counter value.

Thanks in advance,
Roman

77 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi Roman,

there is no IA instruction that directly returns TSC from the core that you can specify as a parameter. Operating systems usually implement various tricks executing rdtsc on all cores and using low-latency thread synchronization/spinning on signal variables to estimate differences between processor TSCs.

Best regards,
Roman

Hi everybody,

>>[ Roman Oderov ]I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all
>>processors relative to one of them...
>>...
>>[ Roman Dementiev ] there is no IA instruction that directly returns TSC from the core that you can specify as a parameter...

However, if you use a Windows OS there are a couple of Win32 API functions that could help you:

- GetCurrentThread
- SetThreadPriority
- SetThreadAffinityMask
- Sleep

Here is what I would try:

- [Step00] Let's say you have 2 CPUs ( CPU1 and CPU2 )
- [Step01] Declare a static / global 'Array' of two 64-bit values
- [Step02] Initialize array values with 0
- [Step03] Create a new thread
- [Step04] Set the thread priority to 'Normal'
- [Step05] Set the thread affinity to CPU1 with SetThreadAffinityMask
- [Step06] Call Sleep( 0 )
- [Step07] Set the thread priority to 'Time Critical'
- [Step08] Use inline assembler and call RDTSC and store the value in 'Array[0]'
- [Step09] Set the thread affinity to CPU2 with SetThreadAffinityMask
- [Step10] Call Sleep( 0 )
- [Step11] Use inline assembler and call RDTSC and store the value in 'Array[1]'
- [Step12] Calculate a difference between 'Array[0]' and 'Array[1]'

Here are some additional notes:

- an overhead for steps [Step08], [Step09], [Step10] and [Step11] has to be evaluated
- it is very important to call Sleep( 0 ) after a call to SetThreadAffinityMask
- do as many as possible tests and some average differences have to used but they should not exceed some accuracy threshold ( in nano-seconds ) defined in your specs

Best regards,
Sergey

Quote:

Sergey Kostrov wrote:

Hi everybody,

>>[ Roman Oderov ]I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all
>>processors relative to one of them...
>>...
>>[ Roman Dementiev ] there is no IA instruction that directly returns TSC from the core that you can specify as a parameter...

However, if you use a Windows OS there are a couple of Win32 API functions that could help you:

- GetCurrentThread
- SetThreadPriority
- SetThreadAffinityMask
- Sleep

Here is what I would try:

- [Step00] Let's say you have 2 CPUs ( CPU1 and CPU2 )
- [Step01] Declare a static / global 'Array' of two 64-bit values
- [Step02] Initialize array values with 0
- [Step03] Create a new thread
- [Step04] Set the thread priority to 'Normal'
- [Step05] Set the thread affinity to CPU1 with SetThreadAffinityMask
- [Step06] Call Sleep( 0 )
- [Step07] Set the thread priority to 'Time Critical'
- [Step08] Use inline assembler and call RDTSC and store the value in 'Array[0]'
- [Step09] Set the thread affinity to CPU2 with SetThreadAffinityMask
- [Step10] Call Sleep( 0 )
- [Step11] Use inline assembler and call RDTSC and store the value in 'Array[1]'
- [Step12] Calculate a difference between 'Array[0]' and 'Array[1]'

Here are some additional notes:

- an overhead for steps [Step08], [Step09], [Step10] and [Step11] has to be evaluated
- it is very important to call Sleep( 0 ) after a call to SetThreadAffinityMask
- do as many as possible tests and some average differences have to used but they should not exceed some accuracy threshold ( in nano-seconds ) defined in your specs

Best regards,
Sergey

Sergey, thank you for your answer. I've chosen right this method in my project.

>>>- an overhead for steps [Step08], [Step09], [Step10] and [Step11] has to be evaluated>>>
In such a unpredictable environment like Windows OS you have to pay attention also to time-critical kernel componenets that are managing context
switching and scheduling these components will always postpone currently running thread.Moreover hardware interrupts and their ISR and DPC will
run at higher priority than user-mode code.Would not be a better option to run timing code in kernel mode at DPC level as a dummy driver.You can also queue a DPC at another CPU so you can have some kind of "concurrency".

>>...Moreover hardware interrupts and their ISR and DPC...

What is DPC?

>>However, if you use a Windows OS there are a couple of Win32 API functions that could help you:
>>
>>- GetCurrentThread
>>- SetThreadPriority
>>- SetThreadAffinityMask
>>- Sleep

In case of Non-Windows OS a similar set these functions has to be used.

>>>What is DPC?>>>
In Windows kernel architecture DPC stands for "Deferred procedure calls".These are global system-wide procedure(kernel objects) scheduled to perform some
action on behalf of driver'sISR routine at DPC interrupt level.

>>In Windows kernel architecture DPC stands for "Deferred procedure calls".These are global system-wide procedure(kernel objects)
>>scheduled to perform some action on behalf of driver'sISR routine at DPC interrupt level.

Thanks, Iliya.

Hi everybody,

>>I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all processors
>>relative to one of them...

I'll provide C/C++ sources for a test case. Here is a screenshot that demonstrates output:

Allegati: 

AllegatoDimensione
Download cpusswitchdemo.jpg84.49 KB

Hi Sergey!
By looking at your console test picture I can see that RDTSC overhead is only 24.000 cpi and is very close to the result measured by Agner Fog.
How many timing tests did you perform?
I can spot small spike before the start of your testing loop could that be a CreateThread function overhead which includes also context switching penalty.
If you are interested in more precise profilling and instruction break down timing analysis you can use XPERF tool and Kernrate tool will track the instruction pointer in kernel space and report the results.

>>By looking at your console test picture I can see that RDTSC overhead is only 24.000 cpi and is very close to
>>the result measured by Agner Fog.

It is good to know that my number looks right.

>>How many timing tests did you perform?

1000000 ( one million )

But, I also tested for 10000000 and 100000000:
...
iNumOfIterations = 1000000;
// iNumOfIterations = 10000000;
// iNumOfIterations = 100000000;
...
and results for RDTSC overhead were very consistent.

>>I can spot small spike before the start of your testing loop could that be a CreateThread function overhead which includes
>>also context switching penalty.

I think this is related to some network transfers.

>>I'll provide C/C++ sources for a test case...

Attached.

Allegati: 

AllegatoDimensione
Download cpusswitchdemo.txt4.77 KB

This is an example of output when some error happened:
...
Test-Case 2 - Switching CPUs at runtime
Switched to CPU2
Switched to CPU1
Test-Case 3 - Retrieving RDTSC values for CPUs
RDTSC for CPU1 : 10124080002908
RDTSC for CPU2 : 10124080010328
RDTSC Difference: 7420 ( RDTSC2 - RDTSC1 )
dwThreadAMPrev1 : 1 ( Processing Error if 0 )
dwThreadAMPrev2 : 0 ( Processing Error if 0 ) // <= It was a simple verification that error processing works
...

** A question to Roman Dementiev (Intel) **

Is there an Intel document that describes TSC related solutions / issues in a multi-core system?

Best regards,
Sergey

>>>It is good to know that my number looks right.>>>
Good job:)

>>>RDTSC Difference: 7420 ( RDTSC2 - RDTSC1 )>>>
These results could be polluted by arbitrary context thread(even your thread) running driver's ISR and DPC routines and also some kernel mode time critical components
could postpone your processing loop.
In order to minimize this dependency run your tests(not RDTSC for-loop) 1e4 or 1e5 times and average the results.

Hi Iliya,

>>>>RDTSC Difference: 7420 ( RDTSC2 - RDTSC1 )
>>>>...
>>>>dwThreadAMPrev2 : 0 ( Processing Error if 0 ) // <= It was a simple verification that error processing works

>>>>RDTSC Difference: 7420 ( RDTSC2 - RDTSC1 )
>>
>>These results could be polluted by arbitrary context thread(even your thread)...

'7420' is a wrong number anyway because in that case I tried to switch processing to a CPU that doesn't exists ( CPU #8 ). I simply wanted to see that error processing works.

>>>'7420' is a wrong number anyway because in that case I tried to switch processing to a CPU that doesn't exists ( CPU #8 ). I simply wanted to see that error processing works.>>>

Misunderstood your post:)

I have not found anything in these Intel manuals that says a RDTSC value could be different for CPUs of some multi-core system.

** Intel(R) 64 and IA-32 Architectures Software Developer's Manual **
>> Volume 3A: System Programming Guide, Part 1 <<

...Chapter 7. MULTIPLE-PROCESSOR MANAGEMENT

** Intel(R) 64 and IA-32 Architectures Software Developer's Manual **
>> Volume 3B: System Programming Guide, Part 2 <<

...Chapter 18. DEBUGGING AND PERFORMANCE MONITORING
...
...18.17 COUNTING CLOCKS
...
Time-stamp counter - Measures clock cycles in which the physical processor is not in deep sleep. These ticks cannot be measured on a logical-processor basis.
...
...18.17.3 Incrementing the Time-Stamp Counter

>>>Time-stamp counter - Measures clock cycles in which the physical processor is not in deep sleep. These ticks cannot be measured on a logical-processor basis.>>>

So HT logical cores cannot be sampled with RDTSC instruction.I think that here is difference between logical HT cores with gp register and apic state and fully fledged cores with it own FPU and SIMD Vector units

But, if I got right, the invariant TSC in newer processors (17.13.1 in Vol.3) guarantees me TSC values been synchronized. Well, in older processors I can't still rely on TSC's of different cores without manual synchronization. Am I right?

Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly.

I'll do another set of tests and I will try to predict a TSC value for a CPU2, for example.

'...Intel guarantees that the time-stamp counter will not wraparound within 10 years after being reset...'

I've calculated for a CPU with 3GHz clock speed a wraparound has to be ~194 years.

>>...Well, in older processors I can't still rely on TSC's of different cores without manual synchronization...

What about a RESET signal that sets a TSC to 0? Does it mean that on a multi-core system the RESET signal occurs at different times for different CPUs?

- T0 for CPU0
- T0+some-delay1 for CPU1
- T0+some-delay2 for CPU2, etc?

How is it possible? Could Intel Hardware Engineers clearly explain it?

>>>>Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly.>>>

Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of "QueryPerformanceCounter" .So it could be an arbitrary CPU.

>>Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of
>>"QueryPerformanceCounter" .So it could be an arbitrary CPU.

I agree with that. However, I can't find in Intel manuals any explanations for:

1. How many TSC registers exist on a multi-core system with many logical CPUs? Is it just one and which is shared between all logical cores? ( in that case TSCs are synchronized by default )

2. Does every logical CPU have its own independent TSC register? Could different TSCs have different values at some time Tn?

3. What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?

The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores. Synchronization between sockets depends on action taken by the OS and on buss clock accuracy. For what it's worth, http://download.intel.com/embedded/software/IA/324264.pdf presents some recommendations for linux, but the authors detract from credibility by presenting confusion factors such as careless switching between IA-64 and Intel64 terminology.
It's not at all clear how QueryPerformanceCounter is implemented, but it hides some annoying differences among CPU families and covers up synchronization problems, as well as eliminating the question of serialization, at large cost in overhead.

>>>. How many TSC registers exist on a multi-core system with many logical CPUs?>>>
By writing logical CPU do you mean HT?

>>>These ticks cannot be measured on a logical-processor basis.>>>
You cannot sample HT logical cores.

>>>What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?>>>
If remember corrctly logical CPU is a HT logical core with reduced resources.Every HT core has an apic and gp registers ,but do not have vector SIMD units nor x87FPU unit.

>>>It's not at all clear how QueryPerformanceCounter is implemented,>>>
QueryPerformanceCounter could be disassembled and statically or dynamically analyzed in order to understand its implementation.I suppose that this functions could use HPET timer.

>>...The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores...

Thank you, Tim. This is what I wanted to understand. Unfortunately, Intel's manuals don't describe all TSC related issues in a multi-core environment.

>>>>These ticks cannot be measured on a logical-processor basis.
>>
>>You cannot sample HT logical cores.

I've created another test-case ( #4 ) and source codes will be provided.

Here is a detailed high-level description of the Test-Case #4:

- a computer system with Windows 32-bit OS has one physical CPU with two logical CPUs
- an Event syncronization object is created in Non-Signaled state
- two Threads '1' and '2' are created in Suspended state
- execution of Threads Resumed but as soon as processing starts Threads wait for 5 seconds until the Event syncronization object changes its state to Signaled
- threads affinity masks are set: Thread '1' is assigned to CPU1 and Thread '2' is assigned to CPU2
- priorities of current Process and Threads are changed to Real-Time
- after a 5 seconds delay the state of the Event syncronization object is changed to Signaled
- both threads are beginning processing ( almost at the same time! ) and they record 16 RDTSC values
- for every RDTSC value an ID ( number of iteration ) is stored as well
- when processing is completed all allocated resources ( handles ) closed
- if there are no any processing errors some statistics is displayed
- even if both threads are executed with Real-Time priorities on different logical CPUs there are always differences in RDTSC values for iterations with the same ID
- a smallest difference I was able to record is ~708 nano-seconds ( 0.708 micro-seconds )
- a smallest average difference I was able to record is ~768.75 nano-seconds ( 0.76875 micro-seconds )

To Roman Oderov:

Roman, I didn't try to synchronize RDTSC values for different CPU but I tried to evaluate delays during execution of two processes on two different logical CPUs. If you try to execute the Test-Case #4 you will get different numbers. Take into account that it is a non-deterministic test and results are always different.

To Sergey Kostrov:

Thanks for the detailed description!
Yes, I was just going to measure delays.

Source codes of the Test-case #4 attached.

Allegati: 

AllegatoDimensione
Download cpusswitchdemo.txt4.74 KB

Processing report log-file attached.

PS: This is how it looks like:

Application - ScaLibTestApp - WIN32_MSC - Release
Tests: Start
> Test1017 Start <
Sub-Test 59
...
Test-Case 4 - Retrieving RDTSC values for CPUs - 2
Threads 1 and 2 created

Iteration Thread 1 Thread 2 Difference
00 11613344836800 11613344835972 -828
01 11613344836992 11613344836188 -804
02 11613344837112 11613344836404 -708 <= Smallest difference
03 11613344837232 11613344836500 -732
04 11613344837340 11613344836596 -744
05 11613344837460 11613344836704 -756
06 11613344837580 11613344836824 -756
07 11613344837688 11613344836932 -756
08 11613344837796 11613344837076 -720
09 11613344837904 11613344837172 -732
10 11613344838012 11613344837292 -720
11 11613344838156 11613344837412 -744
12 11613344838300 11613344837520 -780
13 11613344838444 11613344837640 -804
14 11613344838600 11613344837760 -840
15 11613344838744 11613344837868 -876

Statistics:
Thread 1 started at 11613344836644
Thread 2 started at 11613344835096
Difference 1548

Thread 1 completed at 11613344838924
Thread 2 completed at 11613344838012
Difference 912

dwThreadAMPrev[0]: 3 ( Processing Error if 0 )
dwThreadAMPrev[1]: 3 ( Processing Error if 0 )

Test Completed in 19172 ticks
> Test1017 End <
Tests: Completed

Allegati: 

AllegatoDimensione
Download processingreport.log1.55 KB

One more note regarding negative values for differences:

>>...
>>Iteration Thread 1 Thread 2 Difference
>>...
>>02 11613344837112 11613344836404 -708 <= Smallest difference
>>...

A negative value -708 means that Thread '2' started first and Thread '1' started second. A Windows Tasks Scheduler starts threads one at a time.

>>>different logical CPUs.>>>
What Do you mean by saying "Logical CPU"?
I suppose that you are reffering to HT cores of multicore processor.Because logical processor can run concurrently threads which are not accessing x87 FPU and SIMD vector units.These logical cores(HT) have its own apic and gp and control registers state.

>>What Do you mean by saying "Logical CPU"?
>>I suppose that you are reffering to HT cores of multicore processor...

My development computer has one physical CPU and Windows Task Manager shows two CPUs ( logical ). Is there something wrong here?

To Roman Oderov:

I wonder if you will be able to post results for the Test-Case #4.

Also, in about 2-3 weeks I'll be able to execute these tests on a new computer with a 3rd generation Intel CPU.

Sergey, I'll try to post my results as soon as possible

>>> Is there something wrong here?>>>
No it's ok:)
I was thinking about newest Sandy Bridge CPU's which have multiple cores with two HT "units".I thought that you have such a CPU.
If you are interested you can test HT scaling when you will have Sandy Bridge processor.Such a test could verify inabillity to scale very well
when heavy-floating point calculation is involved and executed on single hyperthreaded core.

>>>It's not at all clear how QueryPerformanceCounter is implemented,>>>
QueryPerformanceCounter could be disassembled and statically or dynamically analyzed in order to understand its implementation.I suppose that this functions could use HPET timer.

You can call QueryPerformanceFrequency (returns counts per second) to have an idea if it is implemented with TSC or HPET.

Thanks,
Roman

Hi,

But, if I got right, the invariant TSC in newer processors (17.13.1 in Vol.3) guarantees me TSC values been synchronized. Well, in older processors I can't still rely on TSC's of different cores without manual synchronization. Am I right?

not quite...

The time-stamp counter on recent Intel processors is reset to zero each time the processor package has RESET asserted. From that point onwards the invariant TSC will continue to tick constantly across frequency changes, turbo mode and ACPI C-states. All parts that see RESET synchronously will have their TSC's completely synchronized. This synchronous distribution of RESET is required for all sockets connected to a single PCH. For multi-node systems RESET might not be synchronous.

The biggest issue with TSC synchronization across multiple threads/cores/packages is the ability for software to write the TSC. The TSC is exposed as MSR 0x10. Software is able to use WRMSR 0x10 to set the TSC. However, as the TSC continues as a moving target, writing it is not guaranteed to be precise. For example a SMI (System Management Interrupt) could interrupt the software flow that is attempting to write the time-stamp counter immediately prior to the WRMSR. This could mean the value written to the TSC could vary by thousands to millions of clocks.

hope this helps,
Roman

>>>You can call QueryPerformanceFrequency (returns counts per second) to have an idea if it is implemented with TSC or HPET.>>>
Thank you Roman.

>>...If you are interested you can test HT scaling...

This is what I'm going to do some time later.

>>...For multi-node systems RESET might not be synchronous...

I wonder how VTune gets times on a multi-node system?
Does VTune use 'QueryPerformanceCounter' Win32 API function or 'RDTSC' instruction?

Hi Sergey,

>>...For multi-node systems RESET might not be synchronous...

I wonder how VTune gets times on a multi-node system?
Does VTune use 'QueryPerformanceCounter' Win32 API function or 'RDTSC' instruction?

I am not VTune developer, but could you please elaborate why are you concerned? Which type of VTune analysis should not work if TSC has a small delta between the sockets? We are probably talking about deltas that are comparable with the delay of just a few remote memory accesses to other socket.

Roman

>>...I am not VTune developer, but could you please elaborate why are you concerned?

I don't have any concerns and I simply would like to know how VTune gets times. Does VTune use 'QueryPerformanceCounter' Win32 API function or 'RDTSC' instruction?

Sergey,

I recommend you to repost your question with the reference to this thread to the Intel VTune forum which is tracked by VTune developers.

Thanks,
Roman

>>>This is what I'm going to do some time later.>>>
It would be great to see the results.
I bet that for heavy floating point load scaling won't give any advantage.Some speedup probably will be due to lack of interdependencies beetwen various instruction beign dispatched to various ports.

Pagine

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi