how to turn off out-of-order execution in Intel processor

how to turn off out-of-order execution in Intel processor

Hi All,
Our project is to optimize instruction scheduling in gcc, by detecting structural hazards. The algorithm employed requires no out-of-order executions by the processor.

Question: Is there a command/mechanism to turn out-of-order execution off in Intel processor?
Target Architecture: 686 processor
Working on: Intel Pentium Dual Core processor

Thanking You,
Dhiraj.

14 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Quoting - ddmetro
Hi All,
Our project is to optimize instruction scheduling in gcc, by detecting structural hazards. The algorithm employed requires no out-of-order executions by the processor.

Question: Is there a command/mechanism to turn out-of-order execution off in Intel processor?
Target Architecture: 686 processor
Working on: Intel Pentium Dual Core processor

Thanking You,
Dhiraj.

I guess if your project was trying to do something like those required by lockstep. Then your optimization approach will be opposite to the normal sense of performance optimization on on out-of-order processors. Instead of minimizing long dependency chains and breaking a long one into smaller pieces; you would do the reverse. Like fusing pieces of smaller dependent chain into a long one, make long dependency chain even longer, make memory load operation have address or value dependency. And you might even consider adding serialization instruction to further impede the out-of-order engine's primary objective.
I'm not aware of a magicknob control you can turn OOO off, but these are some of the things your code can intentionally makeOOO engine inefficient to the point that itcan hardlyperform any OOO operations.

If you wish to optimize for Atom, analyze on Atom.

Thank you - Shih and Tim.

Actually, 3OE could be turned off w/ INSERTS. i.e. cpu can't run several mutually-dependent instructions simultaneously. let's see example:

 --------------------

mov $5, %eax;

mov $7, %rdx;

--------------------

these ones could be picked up by cpu in no time, but we can change code to:

------------------

mov $5, %eax;

xor %eax, %rdx;

xor %eax, %rdx;

mov $7, %rdx;

------------------

Thereby INSERTS turn code to step-by-step mode.

Adding false dependencies using xor (or similar) instructions does not work in newer processors -- the hardware identifies the idiom in the register allocation phase and deletes the instruction from the instruction stream.

This is discussed in several sections of Agner Fog's excellent microarchitecture guide at

http://www.agner.org/optimize/microarchitecture.pdf

Section 8.14 "Breaking Dependency Chains" says that this code elimination is supported by Core 2 and Nehalem microarchitectures, while section 9.9 "Register reallocation and naming" discusses the details of the optimization in Sandy Bridge and Ivy Bridge microarchitectures.  Section 10.8 "Register reallocation and naming" discusses the details of this optimization in the Haswell microarchitecture.  Section 12.6 "Special Cases of Independence" says that a few of these idioms are recognized and eliminated in the Silvermont microarchitecture.

Section 13.8 says that this feature also exists in the VIA Nano processors.  Section 15.15 discusses this feature on the AMD Bulldozer, Piledriver, and Steamroller microarchitectures, while section 16.8 discusses the implementation of this feature on the AMD Bobcat and Jaguar cores.

John D. McCalpin, PhD
"Dr. Bandwidth"

John, such functionality seems absurd: it makes cpu slower + needless to mention, risks of cpu faults thanks to its over-smartness :) + who'd prevent us to use other schemes of false dependency? in fact, it makes sense to treat false dependencies w/ compiler. 2nd moment, generation of 3OE schedule would be better to hand over to compiler as well too. in short, cpu must have as less overheads as possible to perform faster.

The dependency breaking mechanisms I was discussing will not apply to the specific example code you provided -- sorry that I did not look at it closely enough.

The cases that are recognized by the hardware include single instructions that are guaranteed to zero a register -- so the future use does not depend on the prior contents.  This means that the hardware does not need to wait until the prior contents are "ready", and can provide the next instruction *after* the zeroing idiom with a new zero-filled physical register.

An example would be attempting to use the output of the Read Time Stamp Counter (rdtsc) instruction as a false input to the address of a subsequent load instruction to ensure that the load was not issued until after the TSC was read:

rdtsc                                    # writes upper 32 bits of TSC into %edx and lower 32 bits of TSC into %eax

[store TSC value to memory]

xor %eax,%eax                     # set %eax to zero

mov (%ebx,%eax,4),%ecx     # load address pointed to by %ebx into %ecx with false dependency on %eax

In this case, the hardware will recognize that the "xor" instruction zeros the %eax register, so the subsequent move instruction will be assigned a different physical register (zero-filled) for its use of %eax than the physical register used by the rdtsc instruction for its (implicit) %eax register.  Since the physical registers are different, there is no dependency and the "mov" instruction does not need to wait until the rdtsc instruction has completed before it can execute.

In addition to increasing the effective out-of-order capability of the processor, this optimization also allows the processor to eliminate the "xor" instruction entirely -- reducing the number of instructions in the reorder buffer and reducing the pressure on the ALU execution units.

The example provided above of issuing two xor instructions sequentially (which effectively cancel each other out) should work to enforce serialization in current hardware --- but don't be too surprised if a future hardware platform optimizes this away as well....

John D. McCalpin, PhD
"Dr. Bandwidth"

The example provided above of issuing two xor instructions sequentially (which effectively cancel each other out) should work to enforce serialization in current hardware --- but don't be too surprised if a future hardware platform optimizes this away as well....

John, 'tis the very road to nowhere: overhead work in cpu shall skyrocket + the're absolutely no problem to write another approaches to stop 3OE mode. do we need cpu crunching faster or are we up to to chase harmless hacker's tricks ad infinitum??? :)

@John

Sorry for a little  bit off-topic question,but I would like to ask you whether you know how the mapping of architectural registers to physical registers is performer or manager?

Thanks in advance.

@Evgeney:  It is not my processor design -- I am just reporting what I have read about the Intel processors.   The optimizations have the side effect of eliminating deliberate false dependencies, but of course they are included in the implementation to eliminate undesirable false dependencies.  These are relatively common in architectures (like x86) that have lots of instructions that use implicitly selected registers -- %eax, for example,  is used by a large number of instructions, so it is useful to break false dependencies involving that register name.

@ilyapolak: Mapping of architectural registers to physical registers is described in a fair amount of detail in Agner Fog's excellent microarchitecture document (http://www.agner.org/optimize/microarchitecture.pdf).  Some of that material is based on Intel's public disclosures and some is based on experimentation with microbenchmarks.  Agner Fog has a comprehensive library of microbenchmarks (at the same web site) that he has used to develop models of processor operation and to attempt to disambiguate resource limitations at various stages of the processor pipeline.

John D. McCalpin, PhD
"Dr. Bandwidth"

Thank you @John

I must admit that I did not read all his docs.

Evgeney K., the particular optimization of zeroing the register (and eliminating the dependency) is not as much an overhead as it is a benefit, both in performance and power consumption. Note that such instructions are caught in early stages of the CPU pipeline (see the Agner Fog's document John referenced), before they reach the execution units.

In order to implement OOO execution CPU does have to track dependencies, and there has to be a way to break false dependencies so that efficiency is improved. Instructions with a pre-defined result offer a natural way to do that.

 

>>Our project is to optimize instruction scheduling in gcc, by detecting structural hazards. The algorithm employed requires no out-of-order executions by the processor.

Why not write, or find, code that emulates the Intel64 processor (686 in your case). Construct (modify) the emulation code such that it has the no out-of-order characteristics you want. Then optimize your compiler instruction scheduling using your emulated machine... Knowing beforehand that after you perform the compiler optimizations using your virtual processor, that the Intel processor may induce additional optimizations when run on a physical processor.

Introducing serialization into the instruction stream, in addition to possibly being thwarted by the pipeline, also introduces other undesirable effects (possible loss of register(s), instruction cache issues, ??). Whereas in a simulated processor environment you can remove these effects (from the virtual processor).

The following is a suggestion that may work half-way (untested).

Assuming you test your instruction scheduling in gcc by producing test code of typical functions. Assume this test code is NOT dependent on O/S or other system calls. In place of the xor, try using

     out 0x80, eax ; or other 32-bit register just modified, 0x80 is the POST diagnostic port (81, 82, 83 are undefined)

Note, if running in Windows or Linux this causes a GP fault or system trap, then you could boot to Real Mode, and run your test there.

Jim Dempsey

www.quickthreadprogramming.com

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui