Optimizing SDE

Optimizing SDE

I came across 'SDE' while trying to resolve a recent issue - we had executables from a third-party which were compiled
specifically for the Nehalem processor, while we still need to be able to run on
(slightly) older processors (Xeon 5420). The specific machine code that was causing
an issue was 'popcnt'.

Using SDE works, but we've noticed a performance
hit. While it is true that a slightly slower executable is infinitely better
thanone which core dumps (thank you!), Ithought I would take a few minutes to
ask ifanyone might have any better suggestions on how to address the issue
(including possibly addressing this with a kernel module to emulate missing
instructions).

Target operating system:
RHEL/CentOS 5.5 64-bit (although the executables
are compiled w/-m32)
3rd Party executables - built w/gcc 4.2.2 with
-march=native, -mtune=native, and -m32
Thanks in advance for any guidance.Additional note - we can influence the 3rd party - as long as it is minimal and doesn't cost them (much) in performance.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi, That's a good question. I'm not aware of a really good
answer though.

The performance hit from SDE is primarily because it is based on
an x86 to x86 JIT called Pin. Pin scans all instructions before
they execute and writes new code which it inserts in to a
software code cache. When Pin encounters an instruction that
requires emulation, SDE requests that Pin insert a branch and
skip that instruction requiring emulation. All that JIT'ing
takes time and the code we generate is not as efficient
generally speaking as the original code. Hence the slow down
that you are observing.

SDE replaced an internal tool that would trap on unknown
instructions and branch to an emulation routine. Because SDE
avoids the trap overhead, we found that it was hundreds of times
faster for some applications that used instructions requiring
emulation in hot loops. The trap-and-emulate approach pays a
big price each time emulation is needed. SDE pays that price
approximately once when JIT'ing but we must JIT all code. A
trap-and-patch solution would avoid the trap overhead on
secondary executions and sometimes works but not all
instructions are patchable (too short) and some clever binaries
branch to the middle of instructions and do other interesting
things that make a simple trap-and-patch approach less than
fully robust.

Modifying binaries typically has legal implications and should
not be discussed here.

Thanks. Looks like the slow startup is due to the JIT action. Luckily, the applications run for extended periods of time, so a slow start should be acceptable. Looks like we may be able to get access to the source so that we can re-compile for our environment.

Leave a Comment

Please sign in to add a comment. Not a member? Join today