I has been encouraged by some people to measure the effect of L1 instruction cache.
(is known to be of 32kb on core2duo machine).
Methodology which I chose:
1) generate 100kb of uniform code, like
jmp ebx (ebx = $ + 100kb - codesize)
dec eax (this instruction is known to have 1 byte machine code)
cmp eax, 0
2) Place 2*10^9 to eax
3) Calculate jump address to place in ebx, so the amount of code executed is f.e. 10kb
In my unqualified opinion, this should lead to that
function would execute (roughly) the same amount of instructions (2*10^9),
which are localized in variable-sized code block.
Therefore, I expect an increase in execution time when codesize exceeds 32kb -- L1 instruction cache.
Sad, but this methodology doesn't work:(
Times are the same for codesize range from 1 to 64kb
Perhaps, I am totally wrong, and didn't catch what L1I cache is about...
Or this may be because of some processor-internal optimizations, which shrink the code size.