New Optimizations for X86 in Upcoming GCC 5.0: PIC in 32-Bit Mode

By Evgeny V Stupachenko, Published: 12/26/2014, Last Updated: 12/26/2014

Part2. Position independent code (PIC) improvement in 32 bit mode

PIC in 32 bit mode is used to build Android applications, Linux libraries and many other products. Thus GCC performance in that case is very important.

GCC 5.0 should give significant performance gain (up to 30%) for applications with hot integer loops like cryptography, data protection, data compression, hash… where vectorization is not applicable.

In GCC 4.9 EBX register is reserved for global offset table (GOT) address and thus not available for an allocation. That way PIC in 32 bit mode has only 6 available registers (instead of regular 7): EAX, ECX, EDX, ESI, EDI and EBP. This lead to performance losses when there are not enough registers for an allocation. Moreover for some cases in which EBP is also reserved performance could be even worse.

In GCC 5.0 EBX is returned to an allocation making all 7 registers available. This boosts applications with hot integer loops suffering from register pressure. Below are results for a stress test with high register pressure in integer loop.

Test source:

  int i, j, k; 
  uint32 *in = a, *out = b;
  for (i = 0; i < 1024; i++)
    {
      for (k = 0; k < ST; k++)
        {
          uint32 s = 0;
          for (j = 0; j < LD; j++)
            s += (in[j] * c[j][k] + 1) >> j + 1;
          out[k] = s;
        }
      in += LD;
      out += ST;
    }

Where

  • “c” is constant matrix:
const byte c[8][8] = {1, -1, 1, -1, 1, -1, 1, -1,
                      1, 1, -1, -1, 1, 1, -1, -1,
                      1, 1, 1, 1, -1, -1, -1, -1,
                      -1, 1, -1, 1, -1, 1, -1, 1,
                      -1, -1, 1, 1, -1, -1, 1, 1,
                      -1, -1, -1, -1, 1, 1, 1, 1,
                      -1, -1, -1, 1, 1, 1, -1, 1,
                      1, -1, 1, 1, 1, -1, -1, -1};

 

  • “in“ and “out” pointers to global arrays "a[1024 * LD]" and "b[1024 * ST]"
  • uint32 is unsigned int
  • LD and ST – macros defining number of loads and stores in the outer loop

Compilation options "-Ofast -funroll-loops -fno-tree-vectorize --param max-completely-peeled-insns=200" plus "-march=slm" for Silvermont, "-march=core-avx2" for Haswell, "-fPIC" for PIC mode and "-DST=7 -DLD={4, 5, 6, 7, 8}"

"-fno-tree-vectorize" - used to avoid vectorization and therefore xmm registers use

"--param max-completely-peeled-insns=200" - used to make 5.0 and 4.9 behavior similar as 4.9 has the param equal to 100

First let's see by how much GCC 5.0 is faster than GCC 4.9 on the test (in times, higher is better).

The horizontal axis shows number of loads in the loop: LD. Bigger "LD" leads to higher register pressure.

GCC 5.0 gain

Here we can see that both Silvermont and Haswell get a good gain on the loop. To be sure that the gain is caused by returned EBX register to the allocation let's look at 2 charts below.

The charts below show a slowdown (or gain) from enabling PIC mode on Haswell and Silvermont for GCC 5.0 and GCC 4.9 (higher is better).

Slowdown from PIC mode on Haswell

Slowdown from PIC mode on Silvermont

Here we can see that GCC 5.0 does not loose a lot from PIC mode, but GCC 4.9 does on both Haswell and Silvermont. That means that GCC 5.0 should boost some integer loops performance. Moreover now developers can try more aggressive (in terms of register pressure) optimizations like unroll, inline, aggressive invariant motion, copy propagation...

You can find compilers used for the measurements at:

GCC 4.9: https://gcc.gnu.org/gcc-4.9

GCC 5.0 trunk built at revision 217914: https://gcc.gnu.org/viewcvs/gcc/trunk/?pathrev=218160

Test source: Download

 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804