uops? IA32/Intel64 vs. micro-ops?

uops? IA32/Intel64 vs. micro-ops?

Hi All:

I've been doing research on Sun processors such as the Niagaras and on IBM power. I'm now switching to Intel; my current machine has a Core 2 Quad processor.

Anyways, one of the most troubling questions I've been trying to answer is:

Since IA32/Intel64 CISC instructions are broken down into RISC-like simpler micro-ops (uops), what are those uops? Does anyone know?

I don't want to know the exact uops used by Intel. I just want to get an idea of how many uops are common, medium, and more complex instructions broken into.

Is there a document that can give me an idea of which instructions break into more than 4 uops (which in turn require the use of micro-code sequencer in some architectures) ?

Any comments/answers are appreciated.



5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello, Arrazem.

Instructions resulting in >4 uops do not occur in any notable amounts in programs CPUs execute. Compilers _very_ rarely produce such instructions automatically, most of such instructions can be characterized as CPU control instructions, often only allowed to be executed by OS in ring0. If you take few latest generations of Intel CPUs, most instructions will in fact result in only 1 uop in the decoding and retirement domains of the processor. Sometimes this 1-uop is "fused" one:in-order decodingand retirement logic improves its bandwidth via fusing some common pairs of uops, most notably for load+operation kind of instructions, like ADD EAX, [ESI] - it is still 1-uop for decoding and retirement,although in the out-of-order execution back end it gets split into 2 uops (load, add). Sometimes, even two consecutive instructions can be fused into single uop, like very frequent pair of CMP followed by Jcc. I should also add that CPUs decode and front end logic in general is almost never a performance bottleneck, as opposite to e.g. data locality (as measured e.g. by cache hit rate), ratio of computations to the amount data they applied to (limited by memory/cache subsystem bandwidth), and instruction-level parallelism (as measured by IPC average rate of instructions executed per cycle).

Best regards,


Thanks, Max for the reply. I actually know what you mentioned and it is true that instructions with >4uops are negligible for most applications (say 99% of apps ).

However, for research purposes sometimes the focus is on that 1% weird application. That's probably the case with me!

I still can not find any information about uops anywhere! I wish there were a map so that we can map each x86 instruction to its uops that it gets decoded into.

If you have a document and/or any source that can help with this, I would greatly appreciate it if you would share it with me!

Thanks again, Max


You can try the instruction listings at http://www.agner.org/optimize/
Specifically http://www.agner.org/optimize/instruction_tables.pdf
You can probably find the details in some Intel documentation too, I just don't know where.
It is important to remember what uops a particular instruction will decode into may differ between CPUs.

As a side note, this reply editor is very annoying. I have to edit html if I don't want every new line no be in a new paragraph...

Intel has not officially been disclosing uops information - it changes from one CPU to anotherand most likely would be misused anyways. It would be interesting if you coud shed some light why it is important for you.


Leave a Comment

Please sign in to add a comment. Not a member? Join today