I am more concerned about the loop stream detector than the instruction cache. Furthermore, I wonder why ICC puts all these loads next to each other. This should put some stress on the OOO engine.
Have you tried to experiment with "#pragma unroll"? Reducing the unrolling might improve the situation.
The LSD is what I meant. I just didn't recall the right name...
No, I didn't try #pragma unroll. You mean I should try to let icc unroll the whole 1000000 iterations loop? I guess you meant something else.
The "manual unrolling" that is in the code is there to enable the code to make use of instruction level parallelism. If you loop only over the 0s then 1s... then you have a dependency chain between all the instructions in the loop and there's nothing left to execute in parallel. The four independent multiply-add make gcc able to reach peak-performance.