I have (yet another) question about performance counters. I am testing the following simple code on Sandy Bridge:

{ int i; double a = 1.0, b = 2.0, c = 1.0, d = 3.0, e = 1.0, f = 4.0; for(i=0; i<1000000; i++){ a+=b; c+=d; e+=f; } printf(".", a, b, c, d, e, f); }

The assembly looks as follows:

.L77: vaddsd %xmm8, %xmm0, %xmm0 vaddsd %xmm7, %xmm2, %xmm2 vaddsd %xmm6, %xmm4, %xmm4 subl $1, %edi jne .L77

To the best of my knowledge, the above iteration takes 3 cycles per iteration: 3 independent add operations are issued to port 1, so the latency of 3 clocks per add is hidden. Moreover, loop management goes to other ports. The execution is hence bounded by Port 1 and on average 1 loop iteration takes 3 clocks. I have verified that this is the case by timing the code. On my computer (i7-2620M fixed at 2.7GHz), executing the above loop takes 0.001128s, which is almost exactly 3 clocks. The loop has 5 instructions in total, but 4 uops due to fusion, so in every clock cycle there should be on average 4/3 uops issued and retired, with **at least 1 uop issued/retired.**

Now I expected that analyzing the performance counters for such a simple case would be easy, but I must be clearly missing something! Below are some counter values (per iteration) I obtained:

evt 0x0e, mask 0x01, cmask 0, inv 0 UOPS_ISSUED.ANY 4 evt 0x0e, mask 0x01, cmask 1, inv 1 UOPS_ISSUED.ANY, stalls 2 evt 0xa2, mask 0x01, cmask 0, inv 0, RESOURCE_STALLS.ANY 2 evt 0xc2, mask 0x01, cmask 1, inv 1, UOPS_RETIRED.ALL, stalls 0.4 evt 0xc2, mask 0x02, cmask 0, inv 0, UOPS_RETIRED.RETIRE_SLOTS 4

I don't understand why do I see stalls reported? In every cycle there is at least one instruction issued/retired, which would be my understanding of no stalls. However, I get non-zero values for UOPS_ISSUED stall cycles, UOPS_RETIRED stall cycles (different) and RESOURCE_STALLS.ANY. Now I do not exactly know what RESOURCE_STALLS.ANY means, but surely the two other counters should read 0? Or are those counters counting unused resources per clock cycle? But then it also does not add up: there are 4 retirement slots, but only 4/3 used in every clock cycle, so UOPS_RETIRED.ALL, stalls should then show a value of 2.666, not 0.4.

Could anyone explain why am I completely wrong here?

Thanks!