When executing a near function call, the call instruction pushes the instruction pointer onto the stack so that it can be used later as a return-instruction pointer. Atom (architecture code named Bonnell, Silverthorne and Saltwell) encounters a significant architectural bottleneck when the return address pushed onto the stack by a call instruction is quickly read off the stack in the next 4-5 cycles. The read will usually be due to a return but occurs during zero length calls as well. A read in that short window will result in a reissue on the load which causes a 9-10 cycles cost in performance. The most common case where this situation will be encountered occurs when a return is encountered soon after the call instruction is executed. Functions with short dynamic paths of execution will hit this scenario often. Another common situation will result from position independent code (from fpic/FPIC options) where a call/ret pair is executed in order to grab the instruction pointer. The function which grabs the instruction pointer will use the name __i686.get_pc_thunk.bx or something similar with the words "PC", "thunk" or "GOT" (Global Offset Table) in it.
Atom architecture (code-named Bonnell, Silverthorne and Saltwell) will encounter this issue. The cost is around 9-10 cycles.
Below I show a hot instruction path in an application taking a significant hit in performance due to “short” functions which were inserted due to position independent code being enabled. The graph shows the instructions presented in the order they were retired on the Atom core along with the percent clockticks tagged to each instruction on the y-axis. Both “spikes” below represent a location where Atom core is experiencing a bottleneck due to the reissue on the load (from [esp] to ebx) which is attempting to read the newly pushed instruction pointer from the stack.
1) Intel compilers targeted for Atom architecture will avoid this issue
2) Understand the use of position independent code (fpic/FPIC) for 32-bit binaries. The visibility can be changed to certain methods to ensure they are not accessed with symbol preemption (see 1st related source)
3) Explicitly inline known “short” functions with inline/_inline
4) If the function is written in assembly and it cannot be inlined then no-ops can be inserted to separate out the call and ret