For the small testcase 1.c below, on a sandybridge machine, compile 1.c using different text start addresses. (I use llvm to generate 1.s instead of gcc because llvm doesn't do tail duplication for the if statement in the kernel loop and generate more clear code to reflect the problem) clang-3.4 -O2 -m64 1.c -S gcc 1.s -o a.1.out -static -Wl,-Ttext=0x1400010 gcc 1.s -o a.2.out -static -Wl,-Ttext=0x1400020 a.2.out is 20% faster than a.1.out. The only difference is that the kernel loop in a.2.out is 32 bytes aligned while the kernel loop in a.1.out is only 16 bytes aligned. I attached 1.c and 1.s for reference. I set the param commands_real_size of foo to 30 in order to disable LSD, which needs >32 loop iterations to take effect. perf stat -e 01ab (DSB2MITE_SWITCHES.COUNT event) shows for a.1.out the switching between uop cache and legacy decoder is much more frequent than a.2.out. IDQ.MITE_UOPS event also verified the uops from legacy decoder for a.1.out is much more than a.2.out. I tried -falign-loops=32 option on a series of benchmarks. For most of them, -falign-loops=32 could help to reduce DSB2MITE_SWITCHES.COUNT event. So I am wondering how 32 bytes aligned loop starting address could help to reduce dsb2mite switches? Another thing confuses me is that if I run the same binary a.1.out and a.2.out on a westmere machine, there are similar performance difference. But there is no uop cache on westmere machine. So where the performance difference could come from? Any other performance factor related with 32 bytes alignment matters. Thanks, Wei.
For more complete information about compiler optimizations, see our Optimization Notice.