Q&A: RDTSC to measure performance of small # of FP calculations

hutch--
December 29, 2006 9:49 PM PST
Rate
 
#2 Reply to #1
In response to the original question, I suggest that on late PIV hardware (Northwood and Prescott core machines) that you have little chance of getting reliable timings for a short instruction sequence for a variety of reasons.

In the Intel staff responses it has already been mentioned that the first iteration is almost exclusively slower than later iterations but there is another factor that has always effected timings under ring3 access in Windows 32 bit OS versions. Faced with higher privileged processes being able to interfere with lower privilege level operations, you will generally get at least a few percent variation on small samples and it gets worse as the sample gets smaller.

You can reduce this effect by setting the process priority to high or time critical but you will not escape this effect under ring3 access. I have found from practice that for real time testing you need a duration of over half a second before the deviation comes down to within a percent or two.

What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.

    push esi
    push edi

    mov esi, large_number
    mov edi, 1
  align 16
  @@:
    ; your code to time here
    sub esi, edi
    jnz @B

    pop edi
    pop esi

Adjust the immediate "large_number" so that the code you are timing runs for over a half a second, over 1 second is better, set you process priority high enough to reduce the higher privilege interference to some extent and you should start to get timings around the 1% or lower variation.

Two trailing comments, the next generation Intel cores will behave differently on a scale something like the differences between the PIII and PIV processors so be careful not to lock yourself into one architecture. The other comment is as far as I remember the FP instruction range while still being available on current core hardware is being replaced by much faster SSE/2/3 instructions so if your target hardware is late enough to support these instructions, you will probably get a big performance hit if you can use the later instructions.

Regards,

hutch at movsd dot com
http://www.masm32.com



Intel Software Network Forums Statistics

8472 users have contributed to 31603 threads and 100652 posts to date.
In the past 24 hours, we have 31 new thread(s) 115 new posts(s), and 163 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member Edwin B. Ramayya