I'm wondering why the performance of thefollowing loop is not improved by interleaving the last 6 instructions of the loop with the first 10:
mov rcx,0x80000000 ; 2^31
I thoughI could keep the processor (i7 920)busierby putting some of theadds into the dependency chain, but they all resulted in slower exectution times. Can anyone find a reason for this, or possibly get it to go even faster? Are they getting executed at the same time as instructions towards the biginning of the loop. It's quite a big leap....
I am a little surprised. This was the order I put the instructions at first glance-with the intention of rearranging themlaterfor more speed. Little did I know!
If your wondering what the code does,
it is the sum of (parity(i^2) mod 2) over i
where i^2can be a128bit integer.