I have implemented some code using two approaches. I am looking at the results (attached) and I can tell that the "faster" version had less branch mispredictions, less L1 instruction cache misses, less TLB misses but I cannot calculate how many CPU cycles were consumed. The total difference between the two designs is several billion instructions.
Could somebody please glance at my results and assist me in how I can determine where the "additional" CPU cycles were consumed?
These are the memory access costs I have found: