Latency of a General purpose MOV instruction on Intel CPUs

Latency of a General purpose MOV instruction on Intel CPUs

Hi everybody,

I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle. For example, I've completed a set of tests for Intel(R) Pentium(R) 4 CPU 1.60GHz and my numbers are as follows:

[ Intel C++ compiler - DEBUG ]
...
Overhead of Assignment: 1.091 clock cycles
...

[ Intel C++ compiler - RELEASE ]
...
Overhead of Assignment: 1.191 clock cycles
...

A C code with assignment looks like:

unsigned __int64 uiClockCycles = __rdtsc();

and a value returned from RDTSC instruction is assigned to uiClockCycles variable with two General purpose MOV instructions, and it means, that 2 clock cycles will be actually spent.

Thanks in advance.

23 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

I think that two mov instructions are used to load high and low part of RDTSC value.

>>>>...and it means, that 2 clock cycles will be actually spent.
>>
>>...I think that two mov instructions are used to load high and low part...

I know this because a value returned from RDTSC instruction is saved in EDX and EAX registers and in order to load it in a 64-bit variable two MOV instructions are needed. I simply wanted to confirm that a General purpose MOV instruction is always executed in 1 clock cycle on any Intel CPU.

How large was loop counter needed to precisely measure latency of MOV instruction?And how many such a measurements did you average?

Here is a new update.

>>...I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle...

Is that true?

Just completed another set of tests and I couldn't get 1 clock cycle Latency for MOV instruction on Ivy Bridge system with Intel Core i7-3840QM ( 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )

Here are test results:

[ Intel C++ compiler ]
...
Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.372 clock cycles
Final RDTSC Overhead Value: 23.628 clock cycles
...

[ Microsoft C++ compiler ]
...
Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.381 clock cycles
Final RDTSC Overhead Value: 23.619 clock cycles
...

Note: '...Overhead of Assignment...' means Latency of MOV instruction and as you cn see on Ivy Bridge system it is less than 1 clock cycle

These values 0.372 and 0.381 clock cycles are very consistent ( the same from test to test! ) for Intel and Microsoft C++ compilers.

On latest architecture memory moves are executed by two Ports2 and 3 in parallel , but I do not know that this can explain such a low latency.

Zitat:

Sergey Kostrov schrieb:
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle.

you can find this information for specific implementations in the optimization manual [1] in appendix C.3 Latency and Throughput, IIRC latency for MOV is 1 clock for all processors, now it looks like you are more after reciprocal throughput (since you issue two independent MOV in your example), rcp throughput is documented as 0.33 for Sandy Bridge/Ivy Bridge for ex. (i.e. there is 3 ports available for GPR to GPR moves) but may be only 0.5 for older processors

[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
available here: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

@bronxzv

you were faster with your answer about the reciprocal throughput:) I wanted to write exactly the same answer:)

Btw. afaik there are only two ports which are executing load/store instructions.

Zitat:

iliyapolak schrieb:
Btw. afaik there are only two ports which are executing load/store instructions.

load from memory isn't involved in the example at hand, 0.33 is for register to register moves (also for 64-bit MMX and 128-bit XMM registers), the store to memory is not on the critical path in the example at hand (as it's usual for stores)

thanks for correcting my error.

>>...load from memory isn't involved in the example at hand, 0.33 is for register to register moves...

The question was about the Latency ( for any Intel CPU / unfortunately Intel® 64 and IA-32 Architectures Optimization Reference Manual doesn't list all microarchitectures ) and Not about the Throughput.

However, I see that my current test perfectly measured the Throughput of a General purpose MOV instruction on Ivy Bridge system. Here is a verification for 32-bit and 64--bit codes:

[ Intel C++ compiler - RELEASE - 32-bit ]
...
Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.372 clock cycles
Final RDTSC Overhead Value: 23.628 clock cycles
...

[ Intel C++ compiler - RELEASE - 64-bit ]
...
Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction
Min Overhead of Assignment: 0.369 clock cycles
Final RDTSC Overhead Value: 23.631 clock cycles
...

Note: '...Min Overhead of Assignment...' needs to be changed to '...Min Throughput of Assignment...'

>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...

I have that Manual and I saw the numbers for MOV instruction. Thanks.

Any comments from Intel engineers?

Zitat:

Sergey Kostrov schrieb:
>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...

I have that Manual and I saw the numbers for MOV instruction. Thanks.

Any comments from Intel engineers?

as you can see at page C-31 of the optimization manual (written by Intel engineers) the latency was 0.5 for Pentium 4 with the double pumped "Fireball" ALU (signature = 0F_2H) so the answer to your question is clearly no, it isn't 1 clock cycle for all Intel CPUs

Guys, please pause for a moment and let's wait for a comment from Intel engineers. OK?

Sergey,

   I have a suite of 3-4K tests .. which tell me all the instr late, more presice than anything found on the internet.  I get 1 clk on SB/IB for mov.  I also monitor the number eliminated, via move elimination and it appears they can eliminate only 1 move per dispatched set of ops.. I believe.  More food for thought on this.. but it's probably 1 clk.

Perfwise

Here are two more quotes I just found in Intel Manuals:

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-026
April 2012

C.3.1 Latency and Throughput with Register Operands
...
Processor instruction timing data is implementation specific; it can vary between
model encodings within the same family encoding...
...

On Page 738

Latency:
0F_3H - 1
0F_2H - 0.5

Throughput:
0F_3H - 0.5
0F_2H - 0.5

Notes:
0F_3H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4, Pentium D processors
0F_2H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4 processors

Intel(R) 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide
Order Number: 325384-044US
August 2012

CHAPTER 35 MODEL-SPECIFIC REGISTERS (MSRS)
...
Table 35-1. CPUID Signature Values of DisplayFamily_DisplayModel
...
On Page 1151

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

Zitat:

iliyapolak schrieb:

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

0F_2H (family 15, model 2) is for the P4 Northwood core [2] with its double pumped ALU, AFAIK ALU latencies were the same in the original P4 Willamette [1] with CPUID signature = 0F_1H (family 15, model 1)

with the P4 Prescott [3] (0F_3H, i.e family 15, model 3) the double pumped "Fireball" ALU was replaced by a regular ALU at core clock thus latencies increased 

[1] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Willamette.html
[2] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Northwood.html
[3] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Prescott.html

 

Yes it makes sense when double-pumped ALU is taken into account.

Thanks for interesting links.

Btw. it is interesting how the designers of double pumped ALU were able to double the clock of this unit.I think that main reason was low transistor count needed to implement ALU  and thus lower heat disipation.

Feature Request:

Please consider to add numbers for Latency and Throughput for all instructions in

Intel® 64 and IA-32 Architectures Software Developer’s Manual ( A )
Volume 2 ( 2A, 2B & 2C ):
Instruction Set Reference, A-Z

instead of

Intel® 64 and IA-32 Architectures Optimization Reference Manual ( B )

For example, this is how it would be nice to have ( on page 72 in A ):

...
AAA - ASCII Adjust After Addition
Opcode Instruction Op/En 64-bitMode Compat/Leg Mode Description
37 AAA NP Invalid Valid ASCII adjust AL after addition.
Latency n1 Throughput n2

Where n1 and n2 are some numbers.

In that case information about Latency and Throughput for all instructions is consolidated and there is No need to look or search in another Intel Manual(s). Information for different Intel Microarchitectures ( if there are differences ) also could be added in the same way.

Thanks in advance.

Interesting proposition.

>>...Intel® 64 and IA-32 Architectures Optimization Reference Manual ( B )

A chapter about Latency and Throughput for all instructions in that manual looks very outdated and doesn't include information for some CPUs.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen