How 32 bytes alignment affect uop cache?

How 32 bytes alignment affect uop cache?

Wei  M.'s picture

For the small testcase 1.c below, on a sandybridge machine, compile 1.c using different text start addresses. (I use llvm to generate 1.s instead of gcc because llvm doesn't do tail duplication for the if statement in the kernel loop and generate more clear code to reflect the problem) clang-3.4 -O2 -m64 1.c -S gcc 1.s -o a.1.out -static -Wl,-Ttext=0x1400010 gcc 1.s -o a.2.out -static -Wl,-Ttext=0x1400020 a.2.out is 20% faster than a.1.out. The only difference is that the kernel loop in a.2.out is 32 bytes aligned while the kernel loop in a.1.out is only 16 bytes aligned. I attached 1.c and 1.s for reference. I set the param commands_real_size of foo to 30 in order to disable LSD, which needs >32 loop iterations to take effect. perf stat -e 01ab (DSB2MITE_SWITCHES.COUNT event) shows for a.1.out the switching between uop cache and legacy decoder is much more frequent than a.2.out. IDQ.MITE_UOPS event also verified the uops from legacy decoder for a.1.out is much more than a.2.out. I tried -falign-loops=32 option on a series of benchmarks. For most of them, -falign-loops=32 could help to reduce DSB2MITE_SWITCHES.COUNT event. So I am wondering how 32 bytes aligned loop starting address could help to reduce dsb2mite switches? Another thing confuses me is that if I run the same binary a.1.out and a.2.out on a westmere machine, there are similar performance difference. But there is no uop cache on westmere machine. So where the performance difference could come from? Any other performance factor related with 32 bytes alignment matters. Thanks, Wei.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Wei  M.'s picture

Sorry, I disabled rich text, and the line changing disappeared after I submited the post. post it again here in rich text.

For the small testcase 1.c attached, on a sandybridge machine, compile 1.c using different text start addresses. (I use llvm to generate 1.s instead of gcc because llvm doesn't do tail duplication for the if statement in the kernel loop and generate more clear code to reflect the problem)

clang-3.4 -O2 -m64 1.c -S

gcc 1.s -o a.1.out -static -Wl,-Ttext=0x1400010

gcc 1.s -o a.2.out -static -Wl,-Ttext=0x1400020

a.2.out is 20% faster than a.1.out. The only difference is that the kernel loop in a.2.out is 32 bytes aligned while the kernel loop in a.1.out is only 16 bytes aligned. I set the param commands_real_size of foo to 30 in order to disable LSD, which needs >32 loop iterations to take effect.

perf stat -e 01ab (DSB2MITE_SWITCHES.COUNT event) shows for a.1.out the switching between uop cache and legacy decoder is much more frequent than a.2.out. IDQ.MITE_UOPS event also verified the uops from legacy decoder for a.1.out is much more than a.2.out. I tried -falign-loops=32 option on a series of benchmarks. For most of them, -falign-loops=32 could help to reduce DSB2MITE_SWITCHES.COUNT event. So I am wondering how 32 bytes aligned loop starting address could help to reduce dsb2mite switches?

Another thing confuses me is that if I run the same binary a.1.out and a.2.out on a westmere machine, there are similar performance difference. But there is no uop cache on westmere machine. So where the performance difference could come from? Any other performance factor related with 32 bytes alignment matters.

Thanks,

Wei.

Attachments: 

AttachmentSize
Download 1.tar10 KB
iliyapolak's picture

>>>. I set the param commands_real_size of foo to 30 in order to disable LSD>>>

As I understand it LSD loop caching is not dependent on loop counter.It does depend on amont of decoded uops (~28).

perfwise's picture

Upon Ivybridge the uopQ can hold 56 uops, it is 28 upon SB.  I just ran your assembly file on my IB, and here's what I see in the loop as you provided:

0000000000000000 <foo>:
0: 85 ff test %edi,%edi
2: 7e 20 jle 24 <foo+0x24>
4: 44 8d 46 07 lea 0x7(%rsi),%r8d
8: 8d 04 76 lea (%rsi,%rsi,2),%eax
b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
10: 83 fe 02 cmp $0x2,%esi
13: 89 c1 mov %eax,%ecx
15: 7c 03 jl 1a <foo+0x1a>
17: 44 89 c1 mov %r8d,%ecx
1a: 89 0a mov %ecx,(%rdx)
1c: 48 83 c2 04 add $0x4,%rdx
20: ff cf dec %edi
22: 75 ec jne 10 <foo+0x10>
24: c3 retq
25: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
2c: 00 00 00 00

and your loop is:

10: 83 fe 02 cmp $0x2,%esi
13: 89 c1 mov %eax,%ecx
15: 7c 03 jl 1a <foo+0x1a>
17: 44 89 c1 mov %r8d,%ecx
1a: 89 0a mov %ecx,(%rdx)
1c: 48 83 c2 04 add $0x4,%rdx
20: ff cf dec %edi
22: 75 ec jne 10 <foo+0x10>

The ipc of this code is 2.67 and has a 95% hit rate in the uop$ but it generates 6.7 DSB<->MITE transitions.  The distribution of uops delivered by the uop$ is 49% of the time 4 and the other 46% of the time 3 uops.   There are no dynamic token stalls.

Then I put:

.byte 0x3e, 0x3e, 0x3e, 0x90

a no op.. in front of foo.. in your 1.s file and replicated it 4x, so as to aligne the loop not to 0x10 but rather to 0x20.  The performance is now 3.87 ipc, there are 0 DSB<->MITE transitions, and 51% of the time the uop$ delivering 4 uops and 46% of the time 3 uops.  So.. uop rate delivery isn't the problem, the uop$ is delivering about the same rate when active.. you're just not loosing cycles in transitions.

Looking at the last test.. it takes 258 clks per 1000 instructions.. and from the uop$ delivery rate we know the difference in these tests is not associated with that but lost cycles in transitions from DSB to MITE.  The last test has 374 clks per 1000 instructions, with 6.7 transiitions.  That implies each transition costs 17 clks approximately.  That's quite a penalty.. maybe it is 1/2 that because you transition from DSB to MITE and then back from MITE to DSB.  Intel's counter for this penalty doesn't appear to work.  So a transition from one to the other takes ~8 clks or a couple less.. depending upon how much microcode is required to fix the loop below:

10: 83 fe 02 cmp $0x2,%esi
13: 89 c1 mov %eax,%ecx
15: 7c 03 jl 1a <foo+0x1a>
17: 44 89 c1 mov %r8d,%ecx
1a: 89 0a mov %ecx,(%rdx)
1c: 48 83 c2 04 add $0x4,%rdx
20: ff cf dec %edi
22: 75 ec jne 10 <foo+0x10>

In the first test one would expect:

400500: 85 ff test %edi,%edi
400502: 7e 20 jle 400524 <foo+0x24>
400504: 44 8d 46 07 lea 0x7(%rsi),%r8d
400508: 8d 04 76 lea (%rsi,%rsi,2),%eax
40050b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400510: 83 fe 02 cmp $0x2,%esi

to go into 1 way of the index associated with the code in the loop, starting at the front of foo, and:

400513: 89 c1 mov %eax,%ecx
400515: 7c 03 jl 40051a <foo+0x1a>
400517: 44 89 c1 mov %r8d,%ecx
40051a: 89 0a mov %ecx,(%rdx)
40051c: 48 83 c2 04 add $0x4,%rdx

would go into the 2nd way.. so I don't see any issue there.  This:

400520: ff cf dec %edi
400522: 75 ec jne 400510 <foo+0x10>
400524: c3 retq

would go into the next index in the uop$.  I don't see why MITE is being activated for any of this.. other than working around some issue un documented.  

Anybody have a comment from Intel?  Interesting..

perfwise

iliyapolak's picture

It seems that your 0x20 bytes "padding"   enabled alignment on 32-byte boundary of instructions.

Wei  M.'s picture

LSD loop caching is dependent on loop counter. See the following paragraph from Intel optimization reference manual. Experiments show for Westmere, iterations should be larger than 64 for LSD to take effect. For sandybridge, iterations should be larger than 32.

3.4.2.4 Optimizing the Loop Stream Detector (LSD)
Loops that fit the following criteria are detected by the LSD and replayed from the
instruction queue to feed the decoder in Intel Core microarchitecture:

  • Must be less than or equal to four 16-byte fetches.
  • Must be less than or equal to 18 instructions.
  • Can contain no more than four taken branches and none of them can be a RET.
  • Should usually have more than 64 iterations.
iliyapolak's picture

I looked at page 2-7 ,2.1.2.4 and did not see any mention of loop counter.Anyway thanks for correcting me.

Login to leave a comment.