Cache Optimization

Cache Optimization

Hi guys,

Im currently
working on the optimization of some algorithms.

During my
work I found some odd behavior.

First some

1) Im working in Kernel mode using
some Windows Realtime Extension

2) I disabled interrupts and there are no context switches

3) I write back and invalidate the
cache each time I run the algorithm

4) I am using an intel core architecture

5) The Algorithm mainly reads, modifies and writes back memory in a loop

6) The memory area I use is not being paged

Now look at the image below. What I don't understand is the behavior at the beginning. Why are there these peaks in execution time that settle after a few executions. Any idea?

Thanks in advance!



9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


Just out of curiosity, how are you invalidating the cache?

0.28ms is a fairly short amount of time. What timer are you using to do the measurements? Maybe it's the measurement and not he routine that causes this variation. I would use rdtsc in this case.

Kind regards


Hey Thomas,

I'm invalidating the caches using the wbinvd command.
For the measurements I am using the rdtsc (and performance counters to measure cache misses etc).
I'm using c with inline assembly. The measurement routines should not make any difference...

Another interesting thing is:
At these peaks in execution time the l1-cache misses decrease while the l2-cache misses increase.

The timer would have been too easy (but I had to ask because I've seen this too many times).

I assume that you have also verified what the hardware prefetchers are doing, e.g.measure the number of prefetched cache lines ordisable them to cross-check?

I measured the L1/L2 Requests...
The graphs show exactly the same peaks.
I also added some loops and serializing instructions before running the algorithm to make sure that the context switch to kernel mode is done and everything is settled. No changes...

Disable the caches? I didn't try that. But I don't know if the windows platform would like it...
What do you suggest. I can't figure out anything that would lead to a conclusion what influences the behavior.

Is there some instance apart from the caches that prefetches data?


I was referring to the hardwarelogic that automatically prefetches data into the caches before it is requested. These hardware prefetchers can often be disabled in the BIOS. (The cache itself cannot be disabled.) Windows won't notice that the prefetchers are disabled other than that the system runs slower.

The performance counters provide events to monitor how many cache lines are fetched by misses and by the prefetchers. On the latest Intelarchitecture, such events are L1D_PREFETCH.* or L2_DATA_RQSTS.PREFETCH.* and some more.

Kind regards

Mhh these events would be interesting. Too bad my pc is too old!These events are available on i7 and xeon 5500 only.I've never seen bios settings for prefetching (never looked for it either). I'll try that tomorrow.Afaik caching can be prevented (same as disabled for me) using the control registers.Best regards,Michael

I turned off hardware prefetching now.
And then added some serializing instruction after wbinvd just to make sure everything is settled.

Without software prefetching, l1-cache misses stay stable after the first run (which has some more misses).
Nevertheless L2-misses take about 900 runs to settle...

Strange thing!


On Intel Core architecture, you can use the event L2_LD.SELF.PREFETCH.* to monitor the traffic by the prefetchers and compare it to the L2_LD.SELF.DEMAND.*.

Kind regards

Leave a Comment

Please sign in to add a comment. Not a member? Join today