precision of CPU_Time and System_Clock

precision of CPU_Time and System_Clock

Portrait de John Campbell

There have been a number of comments about the precision of the standard timing routines available in ifort.
I have written a simple example which differenciates between the apparent precision of system_clock, given by Count_Rate, and the actual precision available from the different values returned.
I have run this example on ifort Ver 11.1, which I have installed.
It shows that both CPU_TIME and SYSTEM_CLOCK have only 64 ticks per second, which is very poor precision available via the Fortran standard intrinsic routines.
Better precisoin is available ( see QueryPerformanceCounter) and should be provided in these intrinsic routines.

John

Fichier attachéTaille
Téléchargement elapse.f952.54 Ko
88 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Tim Prince

cpu_time can't be compared against QueryPerformance, as the latter doesn't separate process time.
I've not understood explanations on why Windows system_clock could not approach the performance of QueryPerformance or omp_get_wtime. This situation requires us to write applications to switch timers between linux and Windows.

Portrait de Steve Lionel (Intel)

Windows updates the "time of day" clock every 10ms - this is what SYSTEM_CLOCK uses. Yes, Windows has higher precision timers but they have drawbacks for general use.

Steve
Portrait de John Campbell

Steve,

I was not trying to say that CPU_TIME and SYSTEM_ClOCK are the same, but that both these fortran standard routines use timing sources that are updated every 1/64th of a second.
For use as a timing routine with the speed of modern processors, this is a very poor precision to provide. Comparing 64 Hz with 3GHz does not look right to me.
I'm not sure of what system routines support the fortran standard routines but a more accurate solution should be provided.
QueryPerformanceCounter could be a better source for SYSTEM_CLOCK.
GETProcessTimes is the best source I have found for CPU_TIME, although I do not understand why these are limited to 1/64 sec accuracy or if this 64Hztick rate can be varied.
The instruction XRDTSC is also a possibility.
Providing a more reliable timer using the fortran standard routines would be the preferred solution.
As for compatibility between Linux and Windows, I'm sure there are other differences between the two implementations.

I understand that a lot of the problem relates to what is available from Microsoft API, but having a fast and accurate timer should be simpler than what is provided.

John

Portrait de John Campbell

Steve,

I was hoping that someone might be able to provide some more information on the available timers.
For elapsed time, there are better timers than apparently used, such as QueryPerformanceCounter. I don't know of any drawbacks and would recommend it's use for System_Clock.
However for CPU time, GetProcessTimes is the best I know of. MSDN is a bit vague about if the clock rate can be varied from 64 hz.
Does anyone have experience of overcoming this limitation ?

Portrait de IanH

You keep saying "best", "better" etc, but doesn't that depend on what you are trying to do?

What are you trying to do?

Portrait de John Campbell

Ian,

"Best" relates to:
precision,
call time overhead and
side effects.

From past experience, the precision is the most significant where QueryPerformanceCounter effectively provides a high precision of about 10^7 cycles per second, based mainly on the call time overhead.
Other timers with a precision of only 64 hz I would consider poor.
My reading of MSDN is that GetProcessTimes might have a side effect of slowing things down if the clock rate was changed. Unfortunately I don't know how to change the rate, or if it can be done. I may have misread the MSDN documentation.
I have been asking if anyone has any knowledge of this. I was hoping that the Visual Fortran developers might have some knowledge of this.

It just find it surprising that the best CPU precision we can get is updated at 64 hz. It must be accumulated somewhere more frequently than this.

I'm not sure where Steve finds 10 ms (100 Hz). If you don't mean this as about 64 Hz, please let me know where you find this difference.

Again, if anyone knows how to get CPU time to a higher precision, I'd like to know.

John

Portrait de IanH

Are you timing your program, or timestamping data, or...?

Portrait de John Campbell

Ian,

Thanks for your question. I use it for timing programs mostly, where precision is especilly important.
I have a shifted subspace eigen solver I obtained from SAP80, where I time two stages of the solution; the matrix reduction or load case itterations. I use thetiming to estimate the relative duration of each stage. Based on this time Iestimate the convergence time with or without a shifted reduction. With such a crude precision on the timers, my convergence strategy does not work well with ifort. It does with other compilers.
While ifort's SYSTEM_TIME might report a high precision with Count_Rate, the tick reality is much different.
I thought that managing the difference between elapsed time (System_Time)and processor time (CPU_Time) was going to be interesting on a multi processor PC with ifort, but I havn't managed to get there yet.

John

Portrait de John Campbell

To provide a more accurate elapsed time timer, could someone provide a conversion of the following subroutine so that it will compile and run using ifort. Itshould return a much more accurate elapsed time than System_Time, by using the API routine QueryPerformanceCounter.
I would appreciate your assistance.
The improvement provided hopefullycould be demonstrated by including in the test program elapse.f95I included above.

      SUBROUTINE ELAPSE_SECOND (ELAPSE)

!

!     Returns the total elapsed time in seconds

!     based on QueryPerformanceCounter

!     This is the fastest and most accurate timing routine

!

      real*8,   intent (out) :: elapse

!

      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4

      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4

!

      real*8    :: freq  = 1

      logical*4 :: first = .true.

      integer*8 :: start = 0

      integer*8 :: num

      logical*4 :: ll

!      integer*4 :: lute

!

!   Calibrate this time using QueryPerformanceFrequency

      if (first) then

         num   = 0

         ll    = QueryPerformanceFrequency (num)

         freq  = 1.0d0 / dble (num)

         start = 0

         ll    = QueryPerformanceCounter (start)

         first = .false.

!         call get_echo_unit (lute)

!         WRITE (lute,*) 'Elapsed time counter :',num,' ticks per second'

      end if

!

      num    = 0

      ll     = QueryPerformanceCounter (num)

      elapse = dble (num-start) * freq

      return

      end

Portrait de Repeat Offender

      SUBROUTINE ELAPSE_SECOND (ELAPSE)

      use ifwin, only: T_LARGE_INTEGER,QueryPerformanceCounter, QueryPerformanceFrequency

!

!     Returns the total elapsed time in seconds

!     based on QueryPerformanceCounter

!     This is the fastest and most accurate timing routine

!

      real*8,   intent (out) :: elapse

!

!      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4

!      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4

!

      real*8    :: freq  = 1

      logical*4 :: first = .true.

      integer*8 :: start = 0

      integer*8 :: num

      logical*4 :: ll

      type(T_LARGE_INTEGER) :: arg

!      integer*4 :: lute

!

!   Calibrate this time using QueryPerformanceFrequency

      if (first) then

         num   = 0

         ll    = QueryPerformanceFrequency (arg)

         num = transfer(arg,num)

         freq  = 1.0d0 / dble (num)

         start = 0

         ll    = QueryPerformanceCounter (arg)

         start = transfer(arg,start)

         first = .false.

!         call get_echo_unit (lute)

!         WRITE (lute,*) 'Elapsed time counter :',num,' ticks per second'

      end if

!

      num    = 0

      ll     = QueryPerformanceCounter (arg)

      num = transfer(arg,num)

      elapse = dble (num-start) * freq

      return

      end 
      program MAIN__

      real*8 elapse

      call elapse_second(elapse)

      write(*,*) elapse

      end program MAIN__

Portrait de John Campbell

Repeat Offender,

Thanks for your changes to the code. I could not find any reference to the type T_Large_Integer in the ifort help.
I have attached the updated elapse2.f95 program which demonstrates the relative precision of QueryPerformanceCounter (2.6 mHz) in comparison to the intrinsic System_Clock (60 Hz).
I think it ticks the boxes in relation to both precision and call time overhead. I don't know of any side affects when taking this option.

I would recommend this as a better timing solution.

Fichiers joints: 

Fichier attachéTaille
Téléchargement elapse2.f954.59 Ko
Téléchargement elapse2.tce2.02 Ko
Portrait de Les Neilson

T_LARGE_INTEGER is in ifwinty - in my IVFv11 version, I doubt it has been moved.

use ifwin, only: QueryPerformanceCounter, QueryPerformanceFrequency

use ifwinty, only: T_LARGE_INTEGER

works

Les

Portrait de Repeat Offender

T_LARGE_INTEGER is ifortspeak for the LARGE_INTEGER union that QueryPerfomanceFrequency and QueryPerformanceCounter want as a reference argument. Except for big-endian systems where the companion C processor doesn't have a 64-bit integer type, it's the same as INTEGER*8.

I was surprised to see that your machine/OS is one of those that doesn't use RDTSC as the basis for QueryPerformanceCounter. You can use RDTSC directly from ifort if you wish:

module setup

   use ifwin

   implicit none

   private

   public initialize, rdtsc, cp_rdtsc

   integer, parameter :: code32(1) = [-1866256113]

   integer, parameter :: code64(3) = [-1052233457,155721954,-1869560880]

   interface

      function rdtsc()

         integer(8) rdtsc

      end function rdtsc

   end interface

   pointer (cp_rdtsc,rdtsc)

   contains

      subroutine initialize

         integer code(*)

         pointer (ap,code)

         integer, parameter :: nbits = bit_size(ap)
         ap = VirtualAlloc(NULL,12_HANDLE,MEM_COMMIT,PAGE_EXECUTE_READWRITE)

         if(nbits == 32) then

            code(1:size(code32)) = code32

         else

            code(1:size(code64)) = code64

         end if

         cp_rdtsc = ap

      end subroutine initialize

end module setup
program main

   use setup

   implicit none

   integer i
   call initialize

   do i = 1, 10

      write(*,*) rdtsc()

   end do

end program main

Portrait de John Campbell

Repeat Offender,

Thanks for the info in Large_Integer.

Also,thanks for the ifort example of using RDTSC. The problem I have always had withRDTSC is I don't have ready access to the clock rate. For another compiler, I have written a wrapper when for the first time it is used on the PC, I time RDTSC for 10 seconds and then store the calculated clock rate in a file c:\prosser_speed.ini. I then read from the file on subsequent runs.For most (all?) recent processors, this has been the rated speed of the processor, although I do not know of a direct way to get RDTSC_RATE.

While the routines you have discussed are elapsed time counters, are you aware of any more precise ways of retrieving accumulated CPU time of a process at better than 64Hz?
As I indicated before, I have been using the elapsed time for selecting alternative solution approaches while the program is running. I have been contemplating understanding how this approach might be applied to parallel applications. I was wanting to monitor both elapsed and CPU time and see if I could make sense of a strategy based on both times. At 64 Hz, the CPU precision is not there for the test examples I am using.

John

Portrait de IanH

My understanding is that the per thread CPU time counters are only updated at the tick rate that the scheduler uses (which is what's behind the ~60 Hz frequency that you are seeing for anything that has some sort of dependence on the scheduler). So as far as I know - no.

Your elapsed time approach has the issue that on a desktop system it will be influenced by things such as the user moving the mouse (etc.) or other background operating system activities.

When you described the problem you were trying to solve my initial though was that there must be some easier way of getting a measure of the computation effort required by a particular step than timing it - an iteration counter or similar.

Portrait de Repeat Offender

When I am tuning code to see what is the fastest, the units of time have little meaning to me because I just compare the number of clock cycles and from that determine whether the changes I have made were actually an improvement and if so, whether the improvement is worth the risk and effort of incorporating the changes into the working code. Always do a loop with a few measurements so that you can see the signal and the noise. Even so, sometimes the OS throttles the processor and messes up your measurement.

I have only a minimal amount of experience timing multithreaded code and what I did in that case was RDTSC before starting the threads and again after all threads were done. Timing the progress of individual threads seems like it could get noisy as other processes and threads kick each other out of cache as they move from core to core.

Portrait de John Campbell

I think I agree with you both, that it is easy to get confused about what you are trying to do and what the timed measure is saying. When the code becomes multi-threaded, you can't be sure which thread is being timed and what else might be happening in other threads.
The system elapsed time clock does offer some simplicity to the definition of the measure. Ian, as you have noted there can always be problems with other processes runing. Virus checkers have long been aproblem, as is svchost.exe.

I too have a minimal amount of experence in writing multi-threaded or parallel code. I've been reading about it for many years and ifort is my first chance to see how it can work. I have been trying to understand how effective it is and what are the side affects. For a long time my Fortran code has been a sequential approach. The vector instruction set has been a much easier implementation and easier to understand.

We'll keep trying to learn! Thanks for your assistance.

John

Portrait de John Campbell

ifwin.mod is a large file which I presume defines the calling interface for many API routines.

Is there documentation of the fortran calling protocols to use with these routines ? I am interested in the routines:
 GetTickCount
 GetProcessTimes
 GetCurrentProcess

I would like to know if they are used as subroutines or functions and the type and kind of each argument.
My apologies it this is a trivial question, but I could not find this information in "C:\Program Files (x86)\Intel\Compiler\11.1\054\Documentation\en_US\compiler_f\main_for.chm", ( which I notice is the documentation for the previous version of ifort that I am using. I should install the VS update !)

John

 

Portrait de Repeat Offender

To find out about GetTickCount, for example, I would google <b>GetTickCount msdn</b> and the first hit gives me useful documentation including the C prototype and the fact that you link to it via Kernel32.lib.  If you then open up %INCLUDE%\kernel32.f90 with a text editor you can search for GetTickCount and see how ifort writes its interface, nonstandard because ifort doesn't provide a STDCALL companion processor for f2003 interoperability.

Sort of a hacker's method I suppose, but it works well for me.

Portrait de iliyapolak

Quote:

Repeat Offender wrote:

To find out about GetTickCount, for example, I would google <b>GetTickCount msdn</b> and the first hit gives me useful documentation including the C prototype and the fact that you link to it via Kernel32.lib.  If you then open up %INCLUDE%\kernel32.f90 with a text editor you can search for GetTickCount and see how ifort writes its interface, nonstandard because ifort doesn't provide a STDCALL companion processor for f2003 interoperability.

Sort of a hacker's method I suppose, but it works well for me.

If you are interested in exact machine code implementation of GetTickCount function I would advise to use IDA Pro disassembler.I suppose that this function might indirectly(when acting as a wrapper) access RTC clock.

Portrait de iliyapolak

>>>Your elapsed time approach has the issue that on a desktop system it will be influenced by things such as the user moving the mouse (etc.) or other background operating system activities.>>>

The problem lies in unpredictable and pseudorandom(from the programmer's point of view) behaviour of the scheduler.

Portrait de John Campbell

Thanks for your feedback, especially Repeat Offender. I finally took your example code from last July and adapted my performance time testing to ifort. (see attached)
I found the call info I needed in Kernel32.f90 and was able to generate a test routine for elapsed time and CPU time info.

Only QueryPerformanceCounter or rdtsc provide timing info more accurately than 64 cycles per second, while all Fortran intrinsics are very poor.
SYSTEM_CLOCK (an elapsed time counter) should be changed to use either of these routines. Although it reports a clock rate of 1,000,000, the reality is it should report 64 !! (should be prosecuted for misrepresentation)

QueryPerformanceCounter, and QueryPerformanceFrequency ( 2.6 million cycles per second ) works well
rdtsc  has a cycle rate = processor cycle rate (2.67 billion cycles per second ) works very well, but might have some problems.
These are both elapsed time counters.

I know of no CPU time accumulator that is updated at more than 64 times per second.
I have been prompted to do this review because of the poor performanc of the intrinsic routines offered by ifort. Where possible they should be improved.

I have not tested these routines for parallel operation.

John

Fichiers joints: 

Fichier attachéTaille
Téléchargement ifort-timing-test.f9013.24 Ko
Portrait de iliyapolak

>>>QueryPerformanceCounter, and QueryPerformanceFrequency ( 2.6 million cycles per second ) works well rdtsc  has a cycle rate = processor cycle rate (2.67 billion cycles per second ) works very well, but might have some problems. These are both elapsed time counters>>>

You can use RDTSC or QueryPerformanceCounter/QueryPerformanceFrequency timing functions(instruction) they are very accurate.RDTSC problem is the high latency and CPUID serialization which can add even more latency so for very short loops it is not recommended to use RDTSC .Afaik QueryPerformanceCounter uses HPET timer if you are interested here is an article about the HPET drawbacks : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=54ff7e595d763d894104d421b103a89f7becf47c

Portrait de John Campbell

iliyapolak,

Thanks for your comment. What I have been trying to highlight is the poor performance of the fortran intrinsic timers, both elapsed and CPU. Those provided have an accuracy of 1/64 th of a second. This is a very long time interval for a processor that is typically cycling in excess of 2 ghz. I have identified alternatives for elapsed time but not for CPU time.

I had trouble understanding the problems you have identified with RDTSC or QueryPerformanceCounter. I'm not sure how significant they are, in comparison to the problem of using a timer accurate to .015 seconds. Certainly using a timer that says the elapsed time for a program example is zero is not very helpful.
 I'm not sure at what testing frequency the problems you refer to become an issue. I would expect that being able to test at say 10,000 cycles per second ( which is once every 200,000 processor cycles ) is not a high frequency, given what a DO loop can do in 200,000 cycles. The point I am trying to make is that you need to be able to identify activity at better than once every 40 million processor cycles, which is what 1/64th second accuracy provides.

Anyway, the code I have attached demonstrates how to access these more accurate timers. I would recommend that SYSTEM_CLOCK use  RDTSC or QueryPerformanceCounter, although I never have found a call to get the clock rate of RDTSC, which avoids the calibration approach. I actually have a file c:\processor_mhz.txt which stores the value, to avoid the calibration loop.

John

Portrait de iliyapolak

Hi John

Thanks for your answer.Unfortunately I do not know fortran so I won't be able to offer you a helping hand in everything which is related to the programming in fortran. I will try to help you in everything which is related to time measurement on Windows platform.

 

Portrait de Tim Prince

We've pointed out several times that the OpenMP (omp_get_wtime) and MPI timers are much better than the Windows Fortran intrinsics, while system_clock (with integer(8) arguments) is satisfactory in the linux implementations of many compilers, including ifort.

This situation presents a dilemma with the implication that the timer intrinsics (other than possiblty OpenMP timers) can't be made portable between Windows and other operating systems.

On occasion, the Intel compiler team has met such challenges, as when the Microsoft __rdtsc intrinsic was added to Intel C and C++ for linux as well as Windows..

Portrait de iliyapolak

>>>Thanks for your comment. What I have been trying to highlight is the poor performance of the fortran intrinsic timers, both elapsed and CPU>>>

QueryPerformanceCouter function which could probably access as user mode wrapper HEPT timer should be able to achieve 3 ms time precision on Windows 7 platform.HEPT timer supports periodic and aperiodic time measurement which is signalled by the interuppt firing.

>>>I had trouble understanding the problems you have identified with RDTSC>>>

Sorry for not explaining it more clearly.RDTSC itself is not more recommended to use on multithreaded and frequency-throttled CPU.Main reason is that your code can be scheduled to run on different CPU and operating system(ACPI) can lower the frequency of the CPU when the load is not significiant so the RDTSC  as beign derived from the QPI clock is enough accurate for the time measurement.Microsoft recommends HEPT timer which is not dependend on the CPU.Moreover there is also an issue when you are measuring very short code blocks for example a few assembly instruction in such a situiation the measured code is shadowed by the longer latency of the RDTSC and CPUID which is used for serialization.So you need run your code thousand or hundred of times in order to remove RDTSC and CPUID latency. 

Portrait de iliyapolak

Here is a good article about the drawbacks of RDTSC and usage of this instruction to measure performance of the few instructions.

http://software.intel.com/en-us/forums/topic/306222

Portrait de John Campbell

Tim,

I don’t agree with your restrictions on the windows implementation of SYSTEM_CLOCK, because it would be different to Linux. There are many differences between the two O/S.

The fortran standard provided SYSTEM_CLOCK and CPU_TIME to provide a standard way of accessing these measures of performance. If OpenMP have identified better ways of providing this information, then the Intel implementation of the Fortran intrinsics should be improved. The point of these routines was to provide a more standard and convenient coding, while you are suggesting we should go back to a non-standard approach. This all takes effort for the ifort Fortran users, which could be better provided in the improved intrinsic routines.

I am not aware of the standard including multi-thread issues for these intrinsics, but the testing I have identified has not approached this problem either.

My reason for investigating all this is that the results from SYSTEM_CLOCK showed that the elapsed time for the routine I tested was zero, which is not a very helpful result.

I hope you might reconsider so that all other Fortran users of ifort do not have to go to the effort I have.

John

Portrait de iliyapolak

>>>I don’t agree with your restrictions on the windows implementation of SYSTEM_CLOCK, because it would be different to Linux. There are many differences between the two O/S>>>

I suppose that hardware timers are the same on both OS's and timer's registers as seen by the both OS's are also the same.

Portrait de Tim Prince

"results from SYSTEM_CLOCK showed that the elapsed time for the routine I tested was zero"

As others pointed out, the actual tick rate (not the count_rate) for SYSTEM_CLOCK on Windows is in the range 0.01 to 1/64 Hz so you may measure 0 time for smaller intervals.  On linux, SYSTEM_CLOCK will resolve intervals as small as microseconds when used correctly.   It's non-portable only in this sense of poor resolution on Windows.  I don't make the rules, and I agree entirety about the relative inconvenience of timing on Windows.

There are long-established benchmarks such as Livermore Fortran Kernels which perform an analysis to find out how many repetitions are needed to get satisfactory timing accuracy.  That benchmark may take half an hour to run with a timer of 1/64 Hfz resolution or as little as 3 seconds by rdtsc.  rdtsc of course exhibits degraded synchronization among CPUs in spite of the best efforts of the OS.

Benchmarks like lmbench which want to measure cache timing or other events which may take only a few CPU clock cycles must use more inconvenient techniques, with total lack of synchronization between CPUs.

Portrait de iliyapolak

Because your routine ran too fast to be precisly measured by CPU_TIME.Even Win raw thread can run for shorter period than quantum which is based on clock interrupt.

Portrait de John Campbell

Tim,

Your repeated justification for SYSTEM_CLOCK having such unsuitable poor performance does not convince me or probably few others.
It is ridiculous that this situation should persist.

SYSTEM_CLOCK should be changed to use either QueryPerformanceCounter or RDTSC, both of which are much more suitable than the existing routine, which is probably GetTickCount.

John

Portrait de iliyapolak

Completely agree with you.By the way if SYSTEM_CLOCK is really based on GetTickCount so the time measurement can be biased by sleep and hiberantion state.

Portrait de iliyapolak

Because of two timing providers on Win platform where one of them is so called interrupt time and the other is called system time.It is not clear which one use Fortran timing routines.For short performance measurement of code execution more accurate system time should be applied.System time functionality  is reperesented(accessed) by user mode QueryPerformanceCounter/QueryPerformanceFrequency pair of functions.

Portrait de Sergey Kostrov

>>...Even Win raw thread can run for shorter period than quantum which is based on clock interrupt...

A test case with C/C++ sources, please! I really would like to see your test case that proves it.

Portrait de iliyapolak

And how I can predict that my thread will run shorter than quantum period when the thread is executed in unpredictable environment.If I set the lowest priority how I can know and ensure that when my thread is running before quantum tick expires other more priviledged thread will be scheduled to run.There is option to create another thread more priviledged and can someone  be sure that this thread will preempt the first thread even before the quantum expires.How you can be sure that higher interrupt will not run and preempt all those threads.Maybe you can shed some light on it.

Calling sleep(0) on currently executing thread will stop thread's execution before its quantum expires.

In pseudocode

int main(){

Handle currThHndl;

currThHandl = GetCurrentThread();

if(currThHandle == Null){

printf("Error obtaining current thread handle 0x%lx \n",GetLastError());

printf("Current thread pseudo handle is 0x%lx \n",currThHndl);

ExitProcess(0);

}

else

 printf("GetCurrentThread successfuly called current thread pseudo handle is 0x%lx\n",currThHndl);

//calling sleep function with zero argument if successful thread will relinquish its quantum time and highest priority next  ready thread wiil run.

sleep(0);

return 0;

}

Btw. the sentence quoted by you is taken from Windows Internals book in its 6 edition.And you can agree that this book was written by real experts on windows kernel.

Portrait de iliyapolak

@Sergey

What makes you think that some internal(kernel mode) OS mechanism(behaviour) can be exactly measured or estimated by user mode client code?

Portrait de Sergey Kostrov

>>...the sentence quoted by you is taken from Windows Internals book in its 6 edition....

Hold on, please. You're always taking something from some books and not providing C/C++ sources inplemented by you which are proving, or disproving, what you've said. Sorry, but I don't see any evidence that you're doing serious programming and that is why you're quoting someone esle statements. Once again, my question is Could You prove it?

... raw thread can run for shorter period than quantum which is based on clock interrupt...

And, What is a Raw Thread? Or, What is Not a Raw thread? Could you explain by yourself? I'd like to see two C/C++ examples of Raw thread and Not Raw thread.

Of course, many IDZ users ( including me ) also quote MSDN, some articles and another docs. However, we do practical things and we can not be too theoretical all the time because many IDZ users have practical issues or problems and they need practical solutions.

Portrait de Sergey Kostrov

>>...What makes you think that some internal(kernel mode) OS mechanism(behaviour) can be exactly measured or
>>estimated by user mode client code?

I have Not started that discussion and please don't ask me with another question until you've answered my initial question. There has to be a dialog when some technical issues are discussed.

You forgot that in some another thread I've provided a completed test-case to measure differences in values returned by RDTSC intrinsic instruction executed from several threads with accuracy to several nano-seconds ( of course this is Not absolutely accurate, however it will satisfy many performance measurement requirements ) in non-deterministic and non-realtime environment, like Windows XP or Windows 7. Actually, there are already two threads related to that subject:

Forum Topic: Synchronizing Time Stamp Counter
Web-link: software.intel.com/en-us/forums/topic/332570

Note: A test case is attached to my post dated on Tue, 11/06/2012 - 06:49

and

Forum Topic: TSC Synchronization Across Cores
Web-link: software.intel.com/en-us/forums/topic/388964

Portrait de iliyapolak

Once again, my question is Could You prove it?

Is that book   not enough for you.Do I need to prove implemetation of kernel scheduler.May you ask the authors of that book to prove thread quantum question.So according to your logic every technical sentence I will need to prove in order to satisfy you.

Raw thread it is windows thread  not raw thread could be java thread which run on Win platform.

Some code examples not related to this discussion.


Fichiers joints: 

Portrait de iliyapolak

Examples of Not raw threads in this case Java threads.

Fichiers joints: 

Portrait de Sergey Kostrov

>>Do I need to prove implemetation of kernel scheduler.

No.

>>May you ask the authors of that book to prove thread quantum question.

I simply want to understand how it could be possible, if it is possible at all, regarding your statement about an execution of a Win32 thread. I'd like to bring clarity in your stetement and nothing else (!).

Portrait de iliyapolak

Sorry for creating a confusion.

I think that sleep() function called with arg == 0 can simulate such a behaviour when thread is stopped before quantum expires.One interesting question arises which is related to exact moment of quantum interval when the execution is postponed and how to control it programmatically.If  new thread is created from within the main function thread and that new thread has priority rised to high and it is scheduled to run immediatly after creation so how (inside thread's function) or better when sleep(0) call will be executed in order for example to stop the execution after 1/2 of quantum expires.

Portrait de John Campbell

I have again been reviewed the information I have available on the accuracy of different timing routines for CPU or elapsed time. I have attached an updated set of fortran calls for 6 timing routines I have identified. I would recommend these as my best use of teh identified API routines. Any recomendations for improvement would be appreciated.
The timing test program has been improved to test each routine for about 5 seconds.
RDTSC requires an initialising routine to estimate the returned tick frequency which is the processor rate for my test machines.

I have identified 2 that are good for elapsed time : RDTSC and QueryPerformanceCounter
All other routines update their time value at 64 cycles per second.
It would be good if there was a more accurate CPU time routine, but I have not found it. I should see what OpenMP uses !
Again, I would recommend that SYSTEM_CLOCK should be fixed in ifort so that we can reliably use the Fortran intrinsic routine.

The following table summarises the performance of the 6 routines I have identified.

Routine                 Ticks per  CPU cycles  Notes
                           second    per call 
RDTSC                    88514093          30  ticks at processor rate, accuracy limited by call rate 
QueryPerformanceCounter   2594669          47  possibly more robust than RDTSC 
GetTickCount                   64          14  fast, but poor precision 
system_clock                   64         325   
GetProcessTimes                64         386  poor precision but best identified for CPU 
CPU_Time                       64         387 

Ticks per second : is the number of unique time values returned in a second ( best accuracy that can be achieved )
CPU cycles per call : is the number of processor cycles per routine call ( call overhead )

John

( I am hoping the plain text preserves the courier font layout of the table)

Fichiers joints: 

Fichier attachéTaille
Téléchargement ifort-timing-test.f9016.61 Ko
Portrait de Sergey Kostrov

>>Routine Ticks per CPU cycles Notes
>> second per call
>>RDTSC 88514093 30 ticks at processor rate, accuracy limited by call rate
>>QueryPerformanceCounter 2594669 47 possibly more robust than RDTSC
>>GetTickCount 64 14 fast, but poor precision
>>system_clock 64 325
>>GetProcessTimes 64 386 poor precision but best identified for CPU
>>CPU_Time 64 387

Thanks, John! These numbers are really interesting. In 99% GetTickCount satisfies requirements I have.

Did you take into account that your test was executed in Non-Deterministic environment of some Windows operationg system? I simply wanted to say that in order to make these measurements as accurate as possible you need to boost a priority of your process to High or Realtime. In that case your process will preempt threads with lower priority currently executed on a system and they won't affect accuracy of measurements.

Also, Patrick Fay ( Intel ) recommends to do such tests on a different CPU instead of the 1st one ( it is named as CPU 0 in the Task Manager ).

Portrait de iliyapolak

Thank you John great job.

@Sergey returning to your question I have simple multithreaded Win32 threads program which uses Sleep() function to terminate its currently running thread so such a action can simulate what I wrote in one of my previous post.So far I was unable and I do not know if it is possible to relinquish the cpu time at some point during the quantum interval.

Fichiers joints: 

Fichier attachéTaille
Téléchargement threadquantumtestapp.cpp9.8 Ko
Portrait de John Campbell

Sergey,

For elapsed time, RDTSC is the best for me as it takes 30 processor cycles and gives a high precision ( 88 million ticks per second, which is the call rate). While GetTickCount is faster to run ( only 14 processor cycles) it has very poor precision ( 64 ticks per second ) so it is not useful for reporting short elapsed time tests.
I have not tested the accuracy of these timers, over a short or long duration. For the types of testing I do, this is not as significant as there are many external distractions to the meaning of run times, such as other process interuptions. My aim has been to get an indication of relative elapsed times for different programming approaches.

Thats elapsed time, however when it comes to CPU time, the best has precision to only 1/64 second. I can not find anything with better precision.
When it comes to timing processes, and OpenMP coding, the elapsed time is what matters, while the CPU time to elapsed time ratio gives an indication of how many threads are effectively running simultaneously.

Unfortunately I have not achieved very good ratios for the OpenMP programs I have been developing. While I can get multiple threads to run, I am getting clashes in other areas. I'm being told cache clashes are my latest problem, so an efffective OpenMP solution, using ifort Ver 2011 is a way off.

John

Portrait de iliyapolak

>>>While GetTickCount is faster to run ( only 14 processor cycles)>>>

Do you mean total time needed to execute this instruction from user mode stub through the switching to kernel mode?

Portrait de Sergey Kostrov

[ Iliya wrote ]

>>...So far I was unable and I do not know if it is possible to relinquish the cpu time at some point during the quantum interval...

That's not a problem and the Negative Result is also Result because it proves or disproves something. Thanks for the update.

Pages

Connectez-vous pour laisser un commentaire.