Avoid short functions on Atom

One reoccurring theme we have seen in several software stacks running on Atom is that the architecture can take a significant hit in performance from “short” functions.  By “short” functions I mean a function that has very few instructions (~10 instructions) separating the call and the matching return.  The Atom architects came back with a nice explanation of this phenomenon.  The lesson has been to aggressively avoid “short” functions whenever possible for the Atom architecture.  The best methods to avoid “short” functions are listed below:

1)      The latest Intel compilers will take steps to avoid this situation

2)      Avoid use of position independent code (fpic/FPIC) for 32-bit binaries. The fpic compiler option is often unnecessary for shared objects, especially if only one process loads the shared object.

3)      Explicitly inline known “short” functions with inline/_inline

4)      If the function is written in assembly and it cannot be inlined then no-ops can be inserted to separate out the call and ret

For those who are more curious about the reasons for the performance impact, I have put an explanation below with an example.

When executing a near function call, the call instruction pushes the instruction pointer onto the stack so that it can be used later as a return-instruction pointer.  Atom encounters a significant architectural bottleneck when the return address pushed onto the stack by a call instruction is quickly read off the stack in the next 4-5 cycles.  A read in that window will result in a reissue on the load which causes a 9-10 cycles cost in performance.  The most common case where this situation will be encountered occurs when a return is encountered soon after the call instruction is executed.  Functions with short dynamic paths of execution will hit this scenario often.  Another common situation will result from position independent code (from fpic/FPIC options) where a call/ret pair is executed in order to grab the instruction pointer.  The function which grabs the instruction pointer will use the name __i686.get_pc_thunk.bx or something similar with the words "PC", "thunk" or "GOT" (Global Offset Table) in it.

Below I show a hot instruction path in an application taking a significant hit in performance due to “short” functions which were inserted due to position independent code being enabled.  The graph shows the instructions presented in the order they were retired on the Atom core along with the percent clockticks tagged to each instruction on the y-axis.  Both “spikes” below represent a location where Atom core is experiencing a bottleneck due to the reissue on the load (from [esp] to ebx) which is attempting to read the newly pushed instruction pointer from the stack.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.
Tags: