TITLE: Zero Length Calls
DESCRIPTION: A “zero length call” uses the attribute of the call instruction to push the immediate instruction pointer on to the stack and then pops off that address into a register. This is accomplished without any matching return on the call. This code construct is also commonly referred to as a zero displacement call.
RELEVANCE:Negatively impacts Atom cores (Code-named Bonnell, Silverthorne and Saltwell) on all OSs costing between 20 cycles. The construct will be more common on 32-bit Linux platforms in shared object due to Position Independent Code. This will not impact the Core Architecture (code-named Nehalem, Sandy Bridge or Ivy Bridge). Two performance issues will be generated by “zero length calls” in impacted architectures:
1) The load from the “pop” instruction will be reissued incurring ~10 cycle hit in performance as described in the blog, “Avoid Short Functions on Atom”
2) Calls and returns will not match fooling the branch prediction algorithm and will likely mispredict on the next return instruction
call NextIP: //Calling the next instruction which in this case will be just 1 byte away!
pop ebx //AHA! I now have the instruction pointer in ebx and can use it for the forces of evil…
//or just to produce position independent code. :)
//Notice there is no matching return here
The opcode to search for to identify a “zero length call” is (E8 00 00 00 00).
The diagram below shows what we call a “stream of instructions” which presents instructions in the order they were most commonly retired on the core (x-axis) graphed against total clocks tagged to each instruction (in red 1st y-axis) and branch mispredicts (in blue 2nd y-axis). The “zero length call” causing the reissued load show up in “spike1” below which in the asm below is just a call to the next instruction. The “zero length call” causes the next return to mispredict causing a large count of branch mispredicts to show up in the stream. Our toolset predicts branch mispredicts caused by the “zero length call” are costing 21% of this total “stream of instructions” while the “short function call” is estimated at 16%.
Zero length calls are usually due to position independent code in 32-bit code used on shared objects. The related resources below have good descriptions of workarounds.