-fno-gcse for computed gotos?

-fno-gcse for computed gotos?

A recent patch to Python's interpreter (http://bugs.python.org/issue4753) introduced computed gotos to help branch prediction in the main opcode switching loop. MSVC doesn't support computed gotos, but I noticed that icc did, so I wanted to give it a shot and check out the potential performance gains; I ship an embedded Python, so I don't need to use stock Python Windows builds.

Speedups of ~20% are apparently possible on GCC, but only with a certain set of flags: -fno-crossjumping and -fno-gcse -- this is so that the compiler doesn't combine all the gotos into branch, which would defeat the purpose of the patch.

Using pybench, Python's benchmark suite, I tested stock Python 2.6 and a patched interpreter loop with the computed gotos, but saw absolutely no statistically significant performance boost. This was anticipated by some of the people posting on the bug's discussion.

Are there equivalent flags to no-crossjumping or no-gcse for icc? Or maybe some other way I could accomplish the same thing? Looking through the compiler docs I couldn't find anything obvious, but I thought it would be good to ask here as well.

Thanks!

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Assuming the Python interpreter is producing pseudo code where the dispatch is by way of a code number you could generate a table of function addresses

iDispatchCode = GetDispatchCode();
fnDispatch[iDispatchCode]();

Add sanity checks if you wish.

The fnDispatch[] array is filled with the function addresses (including one for invalid syntax)

Jim Dempsey

Quoting - jimdempseyatthecove

The fnDispatch[] array is filled with the function addresses (including one for invalid syntax)

There's a big performance difference between calling functions in an array of function pointers and jumping to an address in a computed goto--at least enough in the inner loop of the VM that it's already notbased on function pointers. (Setting up function arguments and the stack for each opcode is too big a price to bear.) Additionally, the computed goto machinery fits easily "over" the existing switch-based machinery with some macros.

I'm not looking to maintain a new or modified version of Python--just to ship an optimized one :)

Best Reply

My very old Borland C++ V3.0 would optimize dense switch statements to use a jump table and give you exactly what you wanted. MS Visual C++ never did this optimization and used a series of cmp/jne tests (one for each case). I have not examined what the current Intel C++ optimizer does. I suspect from your original post that the optimized release version whe using the non-optimal series of cmp/jne tests (one for each case).

When the VM is constructed as an object and when the state information is contained as member variables within the object, then the functions can be a series of "void fn(void)" member functions. True, you do incure a push and pop. This comes at the benefit of being all C++. To avoid the push/pop you can use a jump dispatch table, unfortunately computed goto is not universally available. If you have a compiler that supports a computed goto, then you would likely write the code using conditional compilation directives for portability reasons.

If you are going to write conditional compilation directives you might as well write conditional sections containing ASM statements that provide the dispatch capability you seek. (note, your destination lables may have to be in ASM blocks preceeding the code of interest. i.e. the asm block for the jump cannot see C++ tags/lables, but it can see ASM tags/lables.

If the optimization switch/case is not providing you the jump table then you could consider inserting some conditional compile asm code to get what you want.

Jim

Quoting - jimdempseyatthecove

My very old Borland C++ V3.0 would optimize dense switch statements to use a jump table and give you exactly what you wanted. MS Visual C++ never did this optimization and used a series of cmp/jne tests (one for each case). I have not examined what the current Intel C++ optimizer does. I suspect from your original post that the optimized release version whe using the non-optimal series of cmp/jne tests (one for each case).

When the VM is constructed as an object and when the state information is contained as member variables within the object, then the functions can be a series of "void fn(void)" member functions. True, you do incure a push and pop. This comes at the benefit of being all C++. To avoid the push/pop you can use a jump dispatch table, unfortunately computed goto is not universally available. If you have a compiler that supports a computed goto, then you would likely write the code using conditional compilation directives for portability reasons.

If you are going to write conditional compilation directives you might as well write conditional sections containing ASM statements that provide the dispatch capability you seek. (note, your destination lables may have to be in ASM blocks preceeding the code of interest. i.e. the asm block for the jump cannot see C++ tags/lables, but it can see ASM tags/lables.

If the optimization switch/case is not providing you the jump table then you could consider inserting some conditional compile asm code to get what you want.

Jim

I'm pretty sure modern MSVC will create jump tables for switch statements when appropriate, but computed goto is still a win over that because of the CPU's branch prediction.

I was afraid that I'd get the "it's possible in assembly" answer ;)

Branch prediction works well with a very small branch set. For an interpreter such as Python you may have over 100 case statements.Each branch taken, whether predicted or not, causes some disruption in the instruction pipeline. Therefore the best case use of branch prediction would potentially be

if(iDispatchCode == fee) goto DOfee;
if(iDispatchCode == fi) goto DOfi;
if(iDispatchCode == fo) goto DOfo;
if(iDispatchCode == fum) goto DOfum;
...

As opposed to switch/case where you would end up with a bunch of branch overs.
When most of the interpreter code is performed in a handfull of statements, branch prediction might yield some benefit. However, since the interpreter is most often NOT issuing the same instruction over and over, but rather occasionally issueing a given instruction the predictionswill generally get set to the "false" path above. which means branch taken over the goto DO... If Intel ICC optimization is smart it will note the code branched over is a goto tag and invert the branch (hadn't looked at the code to see if it does this)

The dispatch table route though will be able to maintain the MRU dispatched addresses in the L1 Data cache, barring that the L2, then L3 Data cache. When the dispatch is NOT to the first couple of addresses I would expect the L1 Data cache to be as fast as the branch predictor at some early index off the table, then it would be faster from there on.

This should be a relatively easy test for you to set up. If you see no improvement in performance then yank the code out. Writing these forum messages took longer than it will take to run the test.

Jim

The asm route:

functionReturningDispatchCode(); // as DWORD
__asm {
jmp dword ptr Dispatch[eax*4]
}

or if Dispatch table as pointer

functionReturningDispatchCode(); // as DWORD
__asm {
mov edx, dword ptr Dispatch
jmp dword ptr [edx+eax*4]
}

Jim

Quoting - jimdempseyatthecove

Branch prediction works well with a very small branch set. For an interpreter such as Python you may have over 100 case statements.Each branch taken, whether predicted or not, causes some disruption in the instruction pipeline. Therefore the best case use of branch prediction would potentially be

if(iDispatchCode == fee) goto DOfee;
if(iDispatchCode == fi) goto DOfi;
if(iDispatchCode == fo) goto DOfo;
if(iDispatchCode == fum) goto DOfum;
...

As opposed to switch/case where you would end up with a bunch of branch overs.
When most of the interpreter code is performed in a handfull of statements, branch prediction might yield some benefit. However, since the interpreter is most often NOT issuing the same instruction over and over, but rather occasionally issueing a given instruction the predictionswill generally get set to the "false" path above. which means branch taken over the goto DO... If Intel ICC optimization is smart it will note the code branched over is a goto tag and invert the branch (hadn't looked at the code to see if it does this)

The dispatch table route though will be able to maintain the MRU dispatched addresses in the L1 Data cache, barring that the L2, then L3 Data cache. When the dispatch is NOT to the first couple of addresses I would expect the L1 Data cache to be as fast as the branch predictor at some early index off the table, then it would be faster from there on.

This should be a relatively easy test for you to set up. If you see no improvement in performance then yank the code out. Writing these forum messages took longer than it will take to run the test.

Jim

The asm route:

functionReturningDispatchCode(); // as DWORD
__asm {
jmp dword ptr Dispatch[eax*4]
}

or if Dispatch table as pointer

functionReturningDispatchCode(); // as DWORD
__asm {
mov edx, dword ptr Dispatch
jmp dword ptr [edx+eax*4]
}

Jim

The actual dispatch code is just

goto opcode_targets[*next_instr++];

So I think I'm going to try your __asm approach; I'll let you know how it goes.

Quoting - kevin_watters

The actual dispatch code is just

goto opcode_targets[*next_instr++];

So I think I'm going to try your __asm approach; I'll let you know how it goes.

Actually I want to make clear that the problem is the optimization present in GCC that combines all of the gotos into one goto, defeating branch prediction. Would __asm blocks prevent the icc optimizer from doing this? (I'll try to find out)

The optimizer should not touch your asm blocks.

In the typical interpreter code, the preponderance of interpreter code bytesbeing not == to prior byte.
If you were to use an ASM branch list method, you would organize the list in MRU order (most frequently used op codes first)

__asm {
mov eax,OpCode
cmp eax,OpX
je DoOpX
cmp eax,OpY
je DoOpy
... 100 more times
}

In the above, the processor branch prediction logic will help alot in reducing the runtime

*** However, excepting for the first 1, 2 or 3 op codes, the above technique will be slower than a branch table. And it diminishes the heuristics of the instruction cache (excepting for the instructions in the branch list).

The GCC computed goto should be as effective as the _asm implementation.
However, the GCC computed goto is not necessarily available in all C++ implementations.
Therefore you will need some conditional compilation code if you want portability
Where you define or not define _USE_computed_goto
and/or _USE_ASM_ (some C++ compilers do not permit inline asm statements)

Jim Dempsey

Quoting - jimdempseyatthecove

The optimizer should not touch your asm blocks.

In the typical interpreter code, the preponderance of interpreter code bytesbeing not == to prior byte.
If you were to use an ASM branch list method, you would organize the list in MRU order (most frequently used op codes first)

__asm {
mov eax,OpCode
cmp eax,OpX
je DoOpX
cmp eax,OpY
je DoOpy
... 100 more times
}

In the above, the processor branch prediction logic will help alot in reducing the runtime

*** However, excepting for the first 1, 2 or 3 op codes, the above technique will be slower than a branch table. And it diminishes the heuristics of the instruction cache (excepting for the instructions in the branch list).

The GCC computed goto should be as effective as the _asm implementation.
However, the GCC computed goto is not necessarily available in all C++ implementations.
Therefore you will need some conditional compilation code if you want portability
Where you define or not define _USE_computed_goto
and/or _USE_ASM_ (some C++ compilers do not permit inline asm statements)

Jim Dempsey

I had benchmarked the wrong code, and using computed gotos IS a 15% win without any extra options. PGO looks like another 15% (at least in pybench) so I'm seeing a very worthwhile speedup. Thanks for the input on this!

If you really want a fast interpreter, since your interpreter compiles to an intermediary code (e.g. strips comments, convert text to dispatch codes and symbol names to symbol numbers etc...) why not simply store the address of the dispatch as the dispatch code?

In my old PDP-11 programming days, R5 used to be the interpreter PC (instruction pointer for language being interpreted). In this design, their was no dispatch routine as it was eliminated. The "compiled" interpreter code contained the address of the routine and argument push list etc... The end of each routine contained JMP @R5+ which meant jump to the location who's address is contained in the location pointed to by R5 and increment R5 in the process. i.e. there was no call/return or in your Python case a jump back to the dispatch code.

The compiled interpreter would be larger but it may run faster.

Jim

Leave a Comment

Please sign in to add a comment. Not a member? Join today