GCC x86 Performance Hints

By Evgeny V Stupachenko,

Published: 09/26/2012   Last Updated: 09/26/2012

      People say that GCC (GNU Compiler Collection) cannot generate effective code compared to other proprietary compilers. Is it a myth or reality? We will try to figure out how things really are with GCC. So, how can GCC compiler produce more effective code? We will describe some optional hints for x86 Linux platform "C", "C++" and "Fortran" compilation that help you get more performance from GCC. It should be useful for those customers and developers who need higher performance, but do not use proprietary compilers for various reasons.

What is default GCC?

      (1) Default GCC optimization level is set to "-O0". GCC manual reads "Without any optimization option, the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results". That default behavior gives very low performance, and using only "gcc" command is not recommended for release compiling.
GCC does not recognize architecture, unless you add "-march=native" option. By default GCC passes options which were set at its configuration. Let’s see how gcc was configured:

   gcc -v
   "Configured with: [PATH]/configure … --with-arch=corei7 --with-cpu=corei7…"

It means that GCC will add "-march=corei7" to command line options.
Most of GCC compilers for x86 (default in 64 bits Linux) add: "-mtune=generic -march=x86-64" to command line options, as were configured accordingly. You can always check options passed by GCC driver and internal options by running the following:

   echo "int main {return 0;}" | gcc [OPTIONS] -x c -v -Q -

For example, here is one of the commonly used commands:

   gcc -O2 test.c

This command will build "test.c" without any specific architecture optimizations and can lead to significant performance losses (compared with specifically tuned code). A reduced (or turned off) vectorization and inefficient code scheduling are most frequent reasons of the performance losses.
To get higher performance you should use the following command:

   gcc -O2 test.c -march=native

Defining your architecture is important! The only exception is when your linked program spends almost all its execution time in GLIBC functions, as most of them determine optimal code for current architecture function while running. Note, that some frequently used static GLIBC functions do not have architecture specialization. Dynamic linking is better if you want faster GLIBC.

      (2) By default most of GCC compilers for x86 in 32 bits mode use x87 floating point model as were configured without "mfpmath=sse". Only if GCC was configured "--with-mfpmath=sse" like:

   gcc -v
   "Configured with: [PATH]/configure … --with-mfpmath=sse…"

It uses SSE floating point model by default. In other cases you should specify "-mfpmath=sse" to improve floating pont performance.
This frequently used command can lead to significant losses in floating point code performance:

   gcc -O2 -m32 test.c

But otherwise, to improve the test performance, you could compile with:

   gcc -O2 -m32 test.c -mfpmath=sse

Adding "-mfpmath=sse" is important in 32 bits mode! The only exception is when your compiler was configured "--with-mfpmath=sse".


32 bits or 64 bits?

      32-bit mode is used to reduce memory usage and as a result reduce memory access time (as more data could be located in the caches).
      In 64-bit mode the number of available registers increases from 6 to 14 general and from 8 to 16 XMM. Also all 64 bits x86 architectures have SSE2 extension by default (so you don’t need to add "-mfpmath=sse").
It is recommended to use 64 bits for HPC applications and 32 bits for phone and tablets applications.


How to achieve maximum performance?

      GCC compiler provides a lot of opportunities to try out in order to achieve higher performance. Below we provide a summary table with recommendations and forecasts for Intel® Atom™ and 2nd Generation Intel® Core™ i7 Processors comparing to just "-O2" option based on GCC 4.7 results, assuming that GCC was configured for x86-64 generic.

     Performance improvement forecast for applications that are commonly used in tablets and phones relative to "-O2" (only in 32 bits mode as it is common for applications on phones and tablets):

-m32 -mfpmath=sse ~5%
-m32 -mfpmath=sse -Ofast -flto ~36%
-m32 -mfpmath=sse -Ofast -flto -march=native ~40%
-m32 -mfpmath=sse -Ofast -flto -march=native -funroll-loops ~43%

     Performance improvement forecast on HPC applications relative to "-O2" (in 32 bits mode):

-m32 -mfpmath=sse ~4%
-m32 -mfpmath=sse -Ofast -flto ~21%
-m32 -mfpmath=sse -Ofast -flto -march=native ~25%
-m32 -mfpmath=sse -Ofast -flto -march=native -funroll-loops ~24%

     Performance improvement forecast on HPC applications relative to "-O2" (in 64 bits mode):

-m64 -Ofast -flto ~17%
-m64 -Ofast -flto -march=native ~21%
-m64 -Ofast -flto -march=native -funroll-loops                            ~22%

64-bit to 32-bit mode advantage on HPC applications at "-O2 -mfpmath=sse" is ~5%
Please note that all numbers in the article are the result of forecast based on certain set of benchmarks.

Below is the short summary list of the options used. You can find full options list and descriptions at http://gcc.gnu.org/onlinedocs/gcc-4.7.1/gcc/Optimize-Options.html"

  • "-Ofast" same as "-O3 -ffast-math" enables high level optimizations and aggressive optimizations on arithmetic calculations (like floating point reassociation)
  • "-flto" enable link time optimizations
  • "-m32" switch to 32 bits mode
  • "-mfpmath=sse" enables use of XMM registers in floating point instructions (instead of stack in x87 mode)
  • "-funroll-loops" enables loop unrolling

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.