Avoid short functions on Atom

One reoccurring theme we have seen in several software stacks running on Atom is that the architecture can take a significant hit in performance from “short” functions.  By “short” functions I mean a function that has very few instructions (~10 instructions) separating the call and the matching return.  The Atom architects came back with a nice explanation of this phenomenon.  The lesson has been to aggressively avoid “short” functions whenever possible for the Atom architecture.  The best methods to avoid “short” functions are listed below:

1)      The latest Intel compilers will take steps to avoid this situation

2)      Avoid use of position independent code (fpic/FPIC) for 32-bit binaries. The fpic compiler option is often unnecessary for shared objects, especially if only one process loads the shared object.

3)      Explicitly inline known “short” functions with inline/_inline

4)      If the function is written in assembly and it cannot be inlined then no-ops can be inserted to separate out the call and ret

For those who are more curious about the reasons for the performance impact, I have put an explanation below with an example.

When executing a near function call, the call instruction pushes the instruction pointer onto the stack so that it can be used later as a return-instruction pointer.  Atom encounters a significant architectural bottleneck when the return address pushed onto the stack by a call instruction is quickly read off the stack in the next 4-5 cycles.  A read in that window will result in a reissue on the load which causes a 9-10 cycles cost in performance.  The most common case where this situation will be encountered occurs when a return is encountered soon after the call instruction is executed.  Functions with short dynamic paths of execution will hit this scenario often.  Another common situation will result from position independent code (from fpic/FPIC options) where a call/ret pair is executed in order to grab the instruction pointer.  The function which grabs the instruction pointer will use the name __i686.get_pc_thunk.bx or something similar with the words "PC", "thunk" or "GOT" (Global Offset Table) in it.

Below I show a hot instruction path in an application taking a significant hit in performance due to “short” functions which were inserted due to position independent code being enabled.  The graph shows the instructions presented in the order they were retired on the Atom core along with the percent clockticks tagged to each instruction on the y-axis.  Both “spikes” below represent a location where Atom core is experiencing a bottleneck due to the reissue on the load (from [esp] to ebx) which is attempting to read the newly pushed instruction pointer from the stack.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

's picture

Very interesting, Michael, especially the thought about short-branch quick returns. How many times do we write functions like this?

void foo() {
if (bar) {
/* do something interesting */
}
}

If what you are saying is true, our chip is going to be sitting there being as useful as a mud brick whenever bar is not true. In this case the fix seems simple enough; put the branch outside the function or inline it. Just goes to show that the cost pf 'pretty' code is sometimes higher than we think.

's picture

Beautiful presentation, well done! George

's picture

What are the microarchitectural reasons for this? Since this applies to call/ret pairs only and not any stack access, does it have to do anything with the return-address-stack cache implementation?

Michael Chynoweth (Intel)'s picture

Don: Yes...this will get hit in quite a few unintentional cases.
George: Thanks for the compliments!
Svilen: This is due to a reissue on the load which is an issue mentioned in the public "Intel® 64 and IA-32 Architectures Optimization Reference Manual" under the Atom processor section. Returns do an implicit load which will be reissued in this case. If you access the stack directly to do a load of the recently pushed IP address it will also be reissued. Does that answer your question?

Michael Chynoweth (Intel)'s picture

Knud Kirkegaard has published a good blog addressing how to minimize the performance impact of symbol preemption on Linux using the Intel compiler.
http://software.intel.com/en-us/blogs/2010/11/10/limit-performance-impact-of-global-symbols-on-linux/

Michael Chynoweth (Intel)'s picture

I was asked a question on this blog and decided to put my answer in the comments. When the Linux compiler option fpic is not utilized in 32-bit on shared objects, it results in a separate copy of the shared object to be loaded for each process. This can present a performance issue if the shared object is loaded by many processes resulting in increases in memory usage and the code footprint. The best option if a developer does not know how many processes load a shared object is to utilize the visibility options from gcc or utilize Knud's methodology for the Intel compiler posted in the previous comment.

Michael Chynoweth (Intel)'s picture

The toolset we used for the analysis above is now live:
http://software.intel.com/en-us/articles/intel-performance-bottleneck-analyzer/

Please download and tell us what you think.

Thanks,

Mike