Q&A: RDTSC to measure performance of small # of FP calculations

Q&A: RDTSC to measure performance of small # of FP calculations

Аватар пользователя Intel Software Network Support

The following is a question received by Intel Software Network Support, followed by the responses provided by our Application Engineering team:


Q. I am developing on Pentium 4 with Windows XP and DevCpp/GCC-Compiler. I need to measure the performance of a small number of floating point calculations (for example, fadd, fsub and so on) . The Enhanced Timer is not suitable because of the overhead. Now I am looking for examples of how to measure it by using processor clocks (RDTSC command). The most referred information I found was in the document "Using the RDTSC Instruction for Performance Monitoring", which is an Intel Corporation document (1997). The source code part of this document (3A) is exactly what I need to do. To use it, it is necessary to calculate the overhead first (variable base) and "warm up" the cache, due to the effects of cache misses and other processes which are using the same processor. Unfortunately the variable base is changing all the time, so it is not possible to produce repeatable measurement. I would be grateful to get some advice.


A. We forwarded this question to several engineers, and received the following responses:



#1. My first question in response to "I need to measure a small number of floating point operations" would be WHY? If it's only a small number, it can't be performance critical!


If it's a large number (even if occasionally in some inner loop), then "overhead" doesn't matter and you can use your timing routines (ETimer, RDTSC, or QueryPerfCounter, etc.). You may have to re-harness your code do this, of course.


#2. Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands.

Typically, one wraps the interesting code sequence in a loop. Also (for certain OS reasons), you should repeat this multiple times -- the first measurement is usually wrong.

i.e.:

repeat 5 times
start = RDTSC
loop 50,000 times
the small number of FP instructions I want to test
endloop
end = RDTSC
end repeat
When you do it this way, loop overhead is pretty incidental, and you can just compute (end-start)/50000 for each iteration and get your performance. I would print each of the 5 trials.

I would expect the first iteration result to be quite different from the following 4 results.


#3. You should also be aware of the caveats around measuring something on a Pentium 4 processor. It will be significantly different on our new cores.I recommend you get a copy of the Intel VTune Performance Analyzer.


#4. Our guess is that you might be reverse engineering performance for key FP sequences and working out cache latencie
s and stride/timing semantics using these annotations.

If this is the case, our guess is that you are likely doing this while playing off requisite algorithm/blocking strategies, perhaps even while comparing our u-arch with a competitive one.

More unlikely, but if it turns out that you are tuning small code sequences on OOO machine, our recommendation would be to guide you otherwise.

RDTSC on the Pentium 4 processor is noisy, synchronizes the pipeline, and at last check, had a latency of ~90-120 clocks in Pentium 4 (former codename Northwood) processor implementation.

This would certainly introduce "Heisenberg" uncertainity aspects into your measurements.

Which version of GCC are you using? Hopefully something post 3.3.* and even 4.1.* would be better.

In the end, if you choose to use a "counter", you will be challenged by SNR issues unless you integrate in the set-up/design of your performance experiments.



Q. In response to #1:
For my scientific project, I need to measure the performance of a small algorithm with different architectures (Processors, operating systems and so on). The algorithm contains just the addition and subtraction of floating point numbers (4-5 operations). I have already measured it with counters like QueryPerfCounter and got some results. To achieve them, I needed to deal with such effects like overhead of loops, calling the QueryPerfCounter from the Windows API, cache refresh and others, which produce a big overhead comparing with operations I need to measure. So after taking into account all of these effects, the results are unfortunately not precise enough. For this reason, I have made the decision to measure primary with RDTSC.


In response to #2:
I have found two methods of measurements in the document "Using the RDTSC Instruction for Performance Monitoring". One of them is dealing with the small length of code, such as in my case. To overcome the effects of instruction and data cache misses, the technique of cache warming is applied. Here is the assembler code (Should be repeated 3 times):


CPUID
RDTSC
mov cyc, eax
CPUID
RDTSC
sub eax, cyc
mov base, eax


Since the variable base is changing for each measurement, it is impossible to get repeatable results. This is my main problem at the moment.


In response to #3:
Is the Intel VTune Performance Analyzer also suitable for small number of operations (like in my case)?


In response to #4:
I am using DevC++ 4.9.9.2 with GCC. I would be glad if you describe more details about these issues.


A. Our engineers responded:



#1. Here's some additional data covering RDTSC operation. I took the time to dust off previous work and corresponding diagnostic programs and re-examined them for validity. First, as to whether or not executing RDTSC distorts the measurement: the fact is that it will for shorter instruction sequences that exec
ute within the "shadow" of an instance of RDTSC execution. Presuming that no power/thermal events that affect core clock frequency take place, on Pentium 4 microarchitecture RDTSC is ~80 clocks and on Intel Core microarchitecture RDTSC is ~65 clocks. This was the basis of my Heisenberg allusion, in that upon inserting pair of RDTSC, one essentially cannot measure time spans less that about twice the pipeline "shadow". Even then, one must be cognizant that recovered precision is in direct proportion to time span duration relative to time span of twice the pipeline "shadow". So this is lower bounds of what minimum time span can be measured using present instruction-based technology. Second, as to whether there is jitter among pairs of RDTSC used to measure time span of an instruction sequence. If purely executing within the core, e.g. recurrence relations, Pentium 4 is very faithful here when executing from the trace cache and no jitter is ever seen (at least by me). On Core u-arch, one will experience jitter, perhaps up to ~25% but typically ~5% of time span being measured. I attribute this to variances in instruction fetch/decode operations when code is not ideally placed relative to measurement and control-flow groups of instructions. There is usually always jitter among pairs of RDTSC used to measure time span of an instruction sequence if there are outstanding memory operations in the pipeline. The standard deviation of measured values range up to ~30% for short less iterative sequences and around 5-10% for long more iterative sequences. This is true on both Pentium 4 and Core u-arch's and use of simple binning is advised, especially when looking for best case performance.


#2. Regardless of what the PRM says, the sequence:
rdtsc
{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks}
rdtsc

is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.

The rdtsc instruction is serializing with respect to itself. It is not actually serializing.

Fundamentally, what all the respondents are saying is that this is an out of order machine, and the very notion of deterring the latency of a 3 instruction sequence is quite slippery. You can get very reliable measurments of larger blocks of code (with a few caveats as noted below). But don't try to measure something small. AND, check that your result is repeatable, and your measurement stable.


==


Lexi S.


IntelSoftware NetworkSupport


http://www.intel.com/software


e="MARGIN: 0in 0in 0pt">Contact us

5 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Igor Levicki

I would like to add that from my experience it is possible to measure even single instruction performance (approximate of course due to dependencies and what not) if you repeat it at least 100,000 times in a loop between two RDTSC instructions.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Аватар пользователя hutch--

In response to the original question, I suggest that on late PIV
hardware (Northwood and Prescott core machines) that you have little
chance of getting reliable timings for a short instruction sequence for
a variety of reasons.



In the Intel staff responses it has already been mentioned that the
first iteration is almost exclusively slower than later iterations but
there is another factor that has always effected timings under ring3
access in Windows 32 bit OS versions. Faced with higher privileged
processes being able to interfere with lower privilege level
operations, you will generally get at least a few percent variation on
small samples and it gets worse as the sample gets smaller.



You can reduce this effect by setting the process priority to high or
time critical but you will not escape this effect under ring3 access. I
have found from practice that for real time testing you need a duration
of over half a second before the deviation comes down to within a
percent or two.



What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.


push esi

push edi



mov esi, large_number

mov edi, 1

align 16

@@:

; your code to time here

sub esi, edi

jnz @B



pop edi

pop esi


Adjust the immediate "large_number" so that the code you are timing
runs for over a half a second, over 1 second is better, set you process
priority high enough to reduce the higher privilege interference to
some extent and you should start to get timings around the 1% or lower
variation.



Two trailing comments, the next generation Intel cores will behave
differently on a scale something like the differences between the PIII
and PIV processors so be careful not to lock yourself into one
architecture. The other comment is as far as I remember the FP
instruction range while still being available on current core hardware
is being replaced by much faster SSE/2/3 instructions so if your target
hardware is late enough to support these instructions, you will
probably get a big performance hit if you can use the later
instructions.



Regards,



hutch at movsd dot com

http://www.masm32.com


Аватар пользователя Intel Software Network Support

Here's a link to a related post: Problem with rdtsc on Pentium D processor


Also, here is another variation we recentlyreceived on the same basic question above, included here to increase the probability of this solution coming up in keywordsearches:



I am trying to use the RDTSC instruction to time my high performance code. I was trying to figure out haw many cycles the RDTSC instruction takes (this isn't documented anywhere as far as I can tell). I have a small bit of assembly code
that demonstrates a problem I'm having. This compiles and runs fine with both Intel and GNU compilers 3.3, 4.0, etc.


when I compile this and execute under Cygwin (running on Windows XP) and an AMD 4200 I get


./a.exe
ticks per rdtsc 6

which isn't 1 or 2, but I can live with 6 clock ticks to process a seldom-called op.


if I compile and run this under Mac OS X (new Apple MacBook Pro) Intel Core 2 I get 65 ?!?!


if I compile and run this on Suse Linux on a Xeon processor, I get 85?!?! (Intel or GNU compilers agree on this)


I'm not even putting in serializing. does that look right to anyone?


Can anyone verify they get the same results on their x86 machines?


The code:


#include



int main(void)
{
unsigned long long int t0, t1;
int result;
unsigned int ret0[2];
unsigned int ret1[2];
__asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));
__asm__ ("xorl %ecx, %ecx
"
"L1:
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"rdtsc
"
"addl $16, %ecx
"
"cmpl $8192, %ecx
"
"jne L1");
__asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1]));


t0 = *(unsigned long long int*)ret0;
t1 = *(unsigned long long int*)ret1;
result = (t1-t0)/8192;
printf("ticks per rdtsc %d
"
,result);
return result;
}


I've run this one various AMD and Intel machines. most of the AMD machines are returning in 6 to 8 cycles, most of the Intel machines I've tried are returning in 60 to 80 cycles, sometimes as high as 100 cycles. It would be nice if there was some way to query the time register faster. This makes performance measuring and tuning a sketchy affair on Intel chips. Is the rdtsc instruction serializing for some reason (draining the pipelines....) ?


Our engineers agree this questionis also addressed by the solution in the first Q&A above.


==


Lexi S.


IntelSoftware NetworkSupport


http://www.intel.com/software


Contact us


Аватар пользователя Intel Software Network Support

Q. I read Mike Stoner's article entitled Portable Performance Measurement Macros for Intel Architecture. I am studying your IAPERF.H file for using the RDTSC instruction to read the time-stamp counter. I noticed that the CPUID instruction immediately precedes the RDTSC. Why? I cannot find these two instructions used together in the IA-32 documentation.


A. The CPUID instruction serializes the processor pipeline so that all of the preceding instructions must retire before it begins execution. Likewise, the following code will not begin execution until the CPUID retires. This is thought to provide a more accurate cycle count on the code being measured. Really, it shouldn't matter very much if you are measuring something that executes for a million cycles or more.


==


Lexi S.


IntelSoftware NetworkSupport


http://www.intel.com/software


Contact us


Зарегистрируйтесь, чтобы оставить комментарий.