Synchronizing Time Stamp Counter

77 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>>recommend you to repost your question with the reference to this thread to the>>>

Do you think that Intel developers will reveal exact implementation of the VTune timers.

>>>>recommend you to repost your question with the reference to this thread to the...
>>
>>Do you think that Intel developers will reveal exact implementation of the VTune timers.

Actually, I don't details and I simply need Yes or No answer, like 'Yes, RDTSC used' or 'No, RDTSC Not used'... Here is a link to my question on the VTune forum:

Forum topic: Does VTune use 'QueryPerformanceCounter' Win32 API function or 'RDTSC' instruction?
Web-link: http://software.intel.com/en-us/forums/topic/335541

Actually, I don't details and I simply need Yes or No answer, like 'Yes, RDTSC used' or 'No, RDTSC Not used'... Here is a link to my question on the VTune forum:
I asked this because a few weeks ago I posted a question on MKL forum and asked about exact algorithm used to approximate Gamma on problematic range [0,001,1.0] and one of Intel employees refused to reveal an algorithmic implementation.

>>...asked about exact algorithm used to approximate Gamma on problematic range [0,001,1.0] and one of Intel employees refused to
>>reveal an algorithmic implementation.

I'm not surprised to hear that. In many cases like yours things are working only in one direction, that is, for the benefit of a corporation. Iliya, try to ask Microsoft to release some sources and you won't get a response at all.

>>>I'm not surprised to hear that. In many cases like yours things are working only in one direction, that is, for the benefit of a corporation. Iliya, try to ask Microsoft to release some sources and you won't get a response at all.>>>
Yes that's true.Sometimes little bit of reversing is the only solution albeit not the simplest and fastest one:)

Sorry, it is off the topic...

>>...asked about exact algorithm used to approximate Gamma on problematic range [0,001,1.0] and one of Intel employees refused to
>>reveal an algorithmic implementation.

Could you try to ask the same question on a GNU Scientific Library ( GSL ) forum?

>>>Could you try to ask the same question on a GNU Scientific Library ( GSL ) forum?>>>
Good question.I will ask this on their forum.GSL source code and implementation is open source so they will probably came with an exact answer.
Btw I solved this problem with the help of Mathematica 8 minimax polynomial calculation.

@Sergey
Can I freely use my own wrappers based on MKL library?

>>Can I freely use my own wrappers based on MKL library?

You need to review MKL's license regarding what you can do and what you can't with the library.

[ To Roman Oderov ]

Any updates? Performance results?

[ to Sergey]

Hi!
I hadn't a lot of time, but here're some results:

p.s. I haven't changed your code deliberately (except some trivial modifications)

Attachments: 

AttachmentSize
Download test1-1.png173.24 KB
Download test1-2.png157.31 KB
Download test1-3.png147.55 KB

In addition I can submit for consideration a .log-file, where 50 consequtive program starts are logged.

Attachments: 

AttachmentSize
Download cpuswitchdemo1.log125.93 KB

Hi Roman, Thanks and I'll take a look at your results.

>>...In addition I can submit for consideration a .log-file, where 50 consequtive program starts are logged...

These are very interesting results and let me analyze it. Thanks again, Roman!

>>...where 50 consequtive program starts are logged...

Roman, why do you need to launch so many applications? Could you explain, please?

"50 prog starts":
I meant to say that it was the test program in which I had a cycle consist of 50 iterations. I think, that it's interesting to know if the values change from time to time and how they differ from each other.

the code:

Attachments: 

AttachmentSize
Download cpuswitchdemo1.cpp8.02 KB

>>... it's interesting to know if the values change from time to time and how they differ from each other...

What is next? How are you going to use it in a real life application?

PS: Thanks for the modified sources and I'll take a look at it some time next week.

I don't know yet... It's necessary to find out how function SetEvent ( Event_handle ) works at the low level, i.e. how many time it is spent for setting an event and for receiving the event by waiting threads. So it could help in more precise timing.

>>...It's necessary to find out how function SetEvent ( Event_handle ) works at the low level, i.e. how many time it is spent for setting
>>an event and for receiving the event by waiting threads...

You could easily do that and just add a call to RDTSC before the call to 'SetEvent' function and save the value. Later you can calculate what the difference is for every thread since you have TSC values in g_iRDTSCValueThreadStart array. Don't forget that there is a call to 'WaitForSingleObject' as well:

...
::WaitForSingleObject( g_hEvent, INFINITE );

g_iRDTSCValueThreadStart[(*piThreadID)-1] = __rdtsc();
...

But we have biased tick values in g_iRDTSCValueThreadStart. So it's important to take into consideration differences in ticks on different processors that we can't rely on yet (because we need to make sure of it).

Oh, I have another question... Why do we have such different Difference values in the log from one iteration to another? As I know processors couldn't desynchronize in little time... Maybe I'm wrong?

>>...Why do we have such different Difference values in the log from one iteration to another?...

I think this is because WIndows Task Scheduler decides when exactly some thread has to be started and maintains execution of the thread later at some times that could not be controlled. I saw your numbers and they vary from 38 nano seconds ( the best value ) to hundreds of nano seconds. So, that multithreading processing is absolutely unpredictable even if different threads are executed on different CPUs. It is clear for me that you need to do some kind of calibration at the beginning of processing based on differences of TSC values.

>>> So, that multithreading processing is absolutely unpredictable even if different threads are executed on different CPUs>>>
In order to prevent thread scheduling overhead it is possible to run code at DPC level on one CPU and put other CPU in some busy wait loop.

Hi Iliya,

>>In order to prevent thread scheduling overhead it is possible to run code at DPC level on one CPU and put other CPU in some
>>busy wait loop.

Could you provide as more as possible technical details on how to do it ( that is, prevent thread scheduling overhead ) at DPC ( Deferred Procedure Calls, right? ) on some CPU?

It would be nice to see an example in C/C++ codes. However, I understand that DPCs are used in Kernel / Driver programming for WIndows platforms. Is that correct?

Thanks in advance.

>>>Could you provide as more as possible technical details on how to do it ( that is, prevent thread scheduling overhead ) at DPC ( Deferred Procedure Calls, right? ) on some CPU?>>>

I will try to write simple device(do not have much experience with the drivers so bear with me:)) driver which will serve as an example.Thread scheduling won't be eliminated because driver code runs in arbitrary thread context, so DPC level code like scheduler and dispatcher will continue their work, but passive level scheduling will be preempted.
My idea is to run driver at passive level elevate IRQL to DPC level do some work on CPU0 when CPU1 is put into busy-wait loop and do it vice-versa.In such a scenario passive level code(user threads) will be preempted until IRQL won't drop to passive level.

>>>It would be nice to see an example in C/C++ codes. >>>

I think it can be done in following way.
-Raising the IRQL on current cpu.
-DPC will be created and other cpu IRQL will be raised
-Some work will be done on the first cpu and the time needed to complete it will be measured.
-At the same time second cpu will spin in busy-wait loop(which could be implemented as a nop instruction).
-When the first cpu completes it will be put in busy-wait loop and the second cpu will start executing some code.
Following Kernel mode functions will be used:
RaiseIRQL,LowerIRQL,KeGetCurrentIrql,KeGetCurrentProcessorNumber,KeNumberProcessors,KeSetTargetProcessorDpc.

>>...Also, in about 2-3 weeks I'll be able to execute these tests on a new computer with a 3rd generation Intel CPU...

Already in transition to a new 64-bit system with i7-3840QM CPU and I hope to complete all setups and installs by the end of November.

This is an example of what I was able to get on Ivy Bridge system:
...
Test-Case 4 - Retrieving RDTSC values for CPUs - 2
Threads 1 and 2 created
RDTSC values ( in CPU clocks ):

Iteration Thread 1 Thread 2 Difference
00 2167050799808 2167050799729 79
01 2167050799883 2167050799813 70
02 2167050799986 2167050799916 70
03 2167050800079 2167050800000 79
04 2167050800182 2167050800093 89
05 2167050800275 2167050800196 79
06 2167050800378 2167050800298 80
07 2167050800471 2167050800392 79
08 2167050800555 2167050800485 70
09 2167050800658 2167050800569 89
10 2167050800742 2167050800662 80
11 2167050800835 2167050800756 79
12 2167050800928 2167050800849 79
13 2167050801012 2167050800933 79
14 2167050801096 2167050801017 79
15 2167050801162 2167050801110 52

Statistics ( in clock cycles ):
Thread 1 started at 2167050799342
Thread 2 started at 2167050799309
Difference 33

Thread 1 completed at 2167050801330
Thread 2 completed at 2167050801260
Difference 70

dwThreadAMPrev[0]: 255 ( Processing Error if 0 )
dwThreadAMPrev[1]: 255 ( Processing Error if 0 )
...
So, in that test case both threads were very synchronized and as you can see the differences in RDTSC values never exceeded 90 clock cycles with the best value 52 clock cycles.

Note: Tested on a system with Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today