Does operating frequency influence cache misses?

Does operating frequency influence cache misses?

imagem de kopcarl

I run 462.libquantum on my i5-2400 in 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively, and I find that LLC misses increase in higher frequency. The details are as follows:

LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz).

I am wondering why changing frequency can influence cache misses?

28 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Patrick Fay (Intel)

Hello kopcarl,

I don't really know why you are seeing the numbers you are seeing. How are measuing you LLC cache misses (which utility are you using)? Is the tool reporting total cache misses over the run or misses/sec? How are you running libquantum? just 1 thread? or multiple threads? Is libquantum the only thing (more or less) running?

There is a pdf Analyzing Libquantum - Rogue Wave Software that indicates libquantum fetches data that doesn't get used. I don't know if the issues described are accurate or still true. It is possible if more than 1 thread is running that 1 thread is kicking out the data needed by another thread. Or maybe that the tool you are using to measure bandwidth runs more frequently (more samples/second) as the frequency increases. I don't know how long libquantum runs so I can't really tell if the 'tool as an issue' possibility is realistic.

Pat

imagem de kopcarl

Hi pat,
thank you for your quick reply:)

I write a self-made code to monitor LLC miss. In fact, I write 0x53412e(LLC Misses) into 0x186 and keep watching on 0xc1 every 10 Million cycles. Considering the overflow, I reset 0xc1 to 0 once I monitor it until the process ends. The total cache misses is over the run. I sum the number every time i monitor .
Libquantum is a single thread program from SPEC06, and it is running on my i5-2400 with Linux-3.6.0 with my monitor process.

I have some level of confidence with my code because I also monitor other events when running libquantum at 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively .
2.74E+12 2.16E+12 1.77E+12 1.64E+12 (UnHalted Reference Cycles) (885.33sec, 697.17sec, 571.37sec, 527.49sec)
1.42E+12 1.46E+12 1.54E+12 1.64E+12 (UnHalted Core Cycles)
2.86E+12 2.86E+12 2.86E+12 2.87E+12 (Instructions Retired)
And these numbers look real.

I admit that the monitor process need another core to run when libquantum is running. But i consider the monitor process will not mess around on LLC.

carl

imagem de John D. McCalpin

Event 2Eh, Mask 41h is the "architectural" performance counter event for LLC misses.  The Intel Architecture SW Developer's Manual, Volume 3, Chapter 18, section 18.2.3 describes these predefined architectural events.  For this event, the document says:

Last Level Cache References— Event select 2EH, Umask 4FH This event counts requests originating from the core that reference a cache line in the last level cache. The event count includes speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers. Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to estimate performance differences is not recommended.

The most important item here is that the count "may" exclude cache line fills due to the L2 hardware prefetchers.

As you slow down the core frequency, you decrease the rate at which it "consumes" data.  This provides the L2 prefetchers more time to get the requested data into the L3 cache, which in turn decreases the L3 cache miss rate.

Sometimes you want to count the total amount of data moved (in which case this counter is not helpful), and sometimes you are more interested in how many memory accesses experience stalls due to missing in the caches (either because the target was not prefetched or because the target was not prefetched early enough to get the data in the cache before the demand request).  This counter event is more appropriate for the latter case.

If you want to know the total amount of data moved to/from the L3 cache for this processor, the best place to look is the memory controller counters.  These are available using VTune Amplifier XE 2013 Update 5, or Intel PCM version 2.4 or later, or you can roll your own analysis tools using the documentation that Intel released on 2013-03-15 (the article is titled "Monitoring Integrated Memory Controller Requests in the 2nd, 3rd and 4th generation Intel® Core™ processors").  If you roll your own tools, you should note that the counters are 32 bits, so they can roll over in about 13 seconds when the system is running at its maximum bandwidth.

John D. McCalpin, PhD "Dr. Bandwidth"
imagem de Sergey Kostrov

>>...LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz).

These numbers need to be normalized to some base frequency.

If the pattern of processing is always the same numbers of cache misses also must be consistent. Don't forget, that all your tests can not be rated as deterministic in non-deterministic operating system because you can not simply stop all the rest system threads in order to get as accurate as possible numbers. Even a priority boost of a thread with your test processing doesn't resolve that problem completely.

>>...I have some level of confidence with my code because I also monitor other events when running libquantum
>>at 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively...

Did you compare your numbers with VTune numbers for the same test case?

imagem de iliyapolak

>>>I am wondering why changing frequency can influence cache misses?>>>

I do not have a direct answer,but I suppose that when frequency is increasing more total work is done hence when the program runs it could? generate more cache misses(just my uneducated guess).

Now I would also check if your results are repeatable each time you are measuring the cache miss rate.Wildly varying values can indicate  the existence of some transient effects which can lead to different results.As it was pointed out your testing environment is non-deterministic and even there could be a possibility related to context switching when your monitoring code is scheduled to run on the same core when only libquantum thread was running before thus polluting the results.Not to mention ssystem threads activity during the same time window.

imagem de kopcarl

Quote:

John D. McCalpin wrote:

The most important item here is that the count "may" exclude cache line fills due to the L2 hardware prefetchers.

As you slow down the core frequency, you decrease the rate at which it "consumes" data.  This provides the L2 prefetchers more time to get the requested data into the L3 cache, which in turn decreases the L3 cache miss rate.

Thank you, John D. McCalpin.

Do you imply that the operating frequency of L2/LLC will not change even if the core frequency is increasing/decreasing?

Quote:

John D. McCalpin wrote:

These are available using VTune Amplifier XE 2013 Update 5, or Intel PCM version 2.4 or later, or you can roll your own analysis tools using the documentation that Intel released on 2013-03-15 (the article is titled "Monitoring Integrated Memory Controller Requests in the 2nd, 3rd and 4th generation Intel® Core™ processors").  

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

imagem de kopcarl

Quote:

Sergey Kostrov wrote:

>>...LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz).

These numbers need to be normalized to some base frequency.

Thank you for your help, Sergey. But why these numbers need to be normalized? I don't get it. 

Quote:

Sergey Kostrov wrote:

If the pattern of processing is always the same numbers of cache misses also must be consistent. Don't forget, that all your tests can not be rated as deterministic in non-deterministic operating system because you can not simply stop all the rest system threads in order to get as accurate as possible numbers. Even a priority boost of a thread with your test processing doesn't resolve that problem completely.

Actually i run this test for several times, and the results are very close. 

Quote:

Sergey Kostrov wrote:

Did you compare your numbers with VTune numbers for the same test case?

Frankly speaking ,i did not compare the numbers with Vtune. I will have a try. 

imagem de Sergey Kostrov

>>...This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well...

Could you post more technical details?

imagem de Roman Dementiev (Intel)

Quote:

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

For Intel Core i5-2400 you should run pcm.x (Linux). It has the memory read and write traffic in GBytes in the READ and WRITE columns.

imagem de kopcarl

Quote:

Sergey Kostrov wrote:

>>...This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well...

Could you post more technical details?

Sure!

root@***:~/pmc/IntelPerformanceCounterMonitorV2.5# ./pcm-memory.x
Intel(r) Performance Counter Monitor: Memory Bandwidth Monitoring Utility
Copyright (c) 2009-2012 Intel Corporation
 This utility measures memory bandwidth per channel in real-time
Num logical cores: 4
Num sockets: 1
Threads per core: 1
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 8
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
Package thermal spec power: 95 Watt; Package minimum power: 60 Watt; Package maximum power: 120 Watt;
Detected Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz "Intel(r) microarchitecture codename Sandy Bridge"
Jaketown CPU is required for this tool! Program aborted
Cleaning up

and if i comment these lines,  I just want to struggle :)  

[c]

if(cpu_model != m->JAKETOWN)
{
cout << "Jaketown CPU is required for this tool! Program aborted" << endl;
m->cleanup();
return -1;
}

[/c]

it returns as follows: 

Detected Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz "Intel(r) microarchitecture codename Sandy Bridge"
Update every 1 seconds
Time elapsed: 998 ms
Called sleep function for 1000 ms
---------------------------------------|
-- Socket 0 --|
---------------------------------------|
---------------------------------------|
---------------------------------------|
-- Memory Performance Monitoring --|
---------------------------------------|
-- Mem Ch 0: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 1: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 2: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 3: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- ND0 Mem Read (MB/s): 0.00 --|
-- ND0 Mem Write (MB/s) : 0.00 --|
-- ND0 P. Write (T/s) : 0 --|
-- ND0 Memory (MB/s): 0.00 --|
---------------------------------------||---------------------------------------
-- System Read Throughput(MB/s): 0.00 --
-- System Write Throughput(MB/s): 0.00 --
-- System Memory Throughput(MB/s): 0.00 --
---------------------------------------||---------------------------------------

imagem de Roman Dementiev (Intel)

does pcm.x work for you?

imagem de kopcarl

Quote:

Roman Dementiev (Intel) wrote:

Quote:

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

For Intel Core i5-2400 you should run pcm.x (Linux). It has the memory read and write traffic in GBytes in the READ and WRITE columns.

Thank you! I am  stupid. :)-

imagem de Roman Dementiev (Intel)

Quote:

Thank you! I am  stupid. :)-

not at all. Perhaps pcm-memory should be extended with pcm.x client memory controller info. Currently pcm-memory supports only server processors.

imagem de kopcarl

Quote:

Roman Dementiev (Intel) wrote:

does pcm.x work for you?

nope.

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

0 0 0.00 0.32 0.00 0.63 9002 22 K 0.60 0.20 0.15 0.05 N/A N/A 67
1 0 0.00 0.36 0.00 0.57 560 1142 0.51 0.00 0.44 0.09 N/A N/A 67
2 0 0.00 0.81 0.00 0.66 1512 7087 0.79 0.45 0.08 0.06 N/A N/A 67
3 0 0.00 0.85 0.00 0.66 283 1596 0.82 0.39 0.04 0.06 N/A N/A 67
-------------------------------------------------------------------------------------------------------------------
SKT 0 0.00 0.46 0.00 0.64 11 K 32 K 0.65 0.28 0.13 0.05 0.00 0.00 67
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.00 0.46 0.00 0.64 11 K 32 K 0.65 0.28 0.13 0.05 0.00 0.00 N/A

imagem de Roman Dementiev (Intel)

can you run "./memoptest 0" in parallel and post pcm.x output? This is a memory test from PCM: build it with "make memoptest".

imagem de kopcarl

Quote:

Roman Dementiev (Intel) wrote:

can you run "./memoptest 0" in parallel and post pcm.x output? This is a memory test from PCM: build it with "make memoptest".

sure!

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

0 0 1.53 1.44 1.06 1.06 2289 K 2300 K 0.00 0.28 0.13 0.00 N/A N/A 39
1 0 0.00 0.16 0.01 1.06 121 K 134 K 0.10 0.02 0.99 0.03 N/A N/A 42
2 0 2.07 1.94 1.06 1.06 26 M 28 M 0.08 0.09 1.43 0.02 N/A N/A 35
3 0 0.00 0.46 0.00 1.06 20 K 22 K 0.10 0.07 0.78 0.02 N/A N/A 45
-------------------------------------------------------------------------------------------------------------------
SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

Instructions retired: 11 G ; Active cycles: 6614 M ; Time (TSC): 3094 Mticks ; C0 (active,non-halted) core residency: 50.20 %

C1 core residency: 0.15 %; C3 core residency: 0.00 %; C6 core residency: 49.65 %; C7 core residency: 0.00 %
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE IPC : 1.68 => corresponds to 42.12 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.90 => corresponds to 22.51 % core utilization over time interval
----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------
SKT 0 package consumed 39.01 Joules
----------------------------------------------------------------------------------------------
TOTAL: 39.01 Joules

imagem de Roman Dementiev (Intel)

Quote:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

I highlighted the read and write traffic in the pcm output above.

imagem de kopcarl

Quote:

Roman Dementiev (Intel) wrote:

Quote:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

I highlighted the read and write traffic in the pcm output above.

Thanks a lot. I see.

imagem de Roman Dementiev (Intel)

You are welcome. It seems your previous pcm measurement was on an idle system.

imagem de kopcarl

Quote:

Roman Dementiev (Intel) wrote:

You are welcome. It seems your previous pcm measurement was on an idle system.

Yes, you are absoultely right! And cound help me with my confusion of the varying LLC miss due to frequency changing?

imagem de John D. McCalpin

The explanation I provided yesterday appears to be correct --- you are seeing fewer "L3 misses" at lower frequencies because the L2 hardware prefetchers have more time to prefetch the data into the L3 cache before the demand miss (or prefetch) reaches the L3 cache.

To test this I took a simple code that repeatedly sums a long vector (250 MB) and measures the "L3 cache misses" using exactly the same event that you used:
        perf stat -e r53412e ./ReadOnly_withStalls 0

The argument to the code is the number of times to execute a delay loop (with no memory references) between memory loads. I use the "rdtsc" instruction to create the delay loop, and add the low-order 32 bits of the TSC to a dummy variable that is printed at the end (to prevent the optimizer from removing the delay code).

With an argument of 0 (no added delays), the code reported that it was reading from memory at an average bandwidth of 13.47 GB/s while "perf stat" reported an average frequency of 3.1 GHz.
The reported number of L3 cache misses was 89578602, compared to an expected value of 327680000 (250 MiB read 80 times), so 27.3% of the loads were reported as cache misses.

With an argument of 1, the code reported an average memory bandwidth of 1813 MB/s, showing that the delay loop worked.  The average frequency reported by "perf stat" remained 3.1 GHz.
This time the reported number of L3 cache misses was 29396814, or 8.9% of the actual cache lines transfered.

So spreading out the loads reduced the reported L3 cache "misses" by a factor of more than three, with the CPU frequency fixed.

The lesson is that the "L3 cache miss" event increments when a load (or L1 prefetch) arrives at the L3 cache before the data arrives at the L3 cache.  This is a *subset* of the loads that got their data from beyond the L3.  The other part of the subset (which is not measured by this counter) consists of the loads whose data was prefetched (by one of the L2 prefetchers) into the L3 cache before the load (or L1 prefetch) arrived at the L3.

To finally put the nail in the coffin on this issue, I disabled the L2 prefetchers and re-ran the test cases.  In each case the number of reported "L3 misses" was 101.54% of the expected value.  The "extra" 1.54% corresponds almost exactly to the 1/64 increase in traffic required to load the TLB entries (one 64 Byte cache line read for every 4 KiB page).

Case closed.

John D. McCalpin, PhD "Dr. Bandwidth"
imagem de iliyapolak

>>>Actually i run this test for several times, and the results are very close>>>

I am not sure if you can pin counter to specific address space of executing thread.

imagem de kopcarl

Quote:

John D. McCalpin wrote:

To finally put the nail in the coffin on this issue, I disabled the L2 prefetchers and re-ran the test cases.  In each case the number of reported "L3 misses" was 101.54% of the expected value.  The "extra" 1.54% corresponds almost exactly to the 1/64 increase in traffic required to load the TLB entries (one 64 Byte cache line read for every 4 KiB page).

Thank you for your contributions to my question. Your tests sounds very convictive. Do you know how to disable L2 prefetchers on i5-2400? 

carl

imagem de John D. McCalpin

I don't know of any public references to the configuration bits used to disable prefetchers on the various Intel processors. 

Enable/disable hardware prefetch functionality is available via BIOS options on many systems, so it must be documented for the BIOS writers. 

This may be a case of simple caution -- although disabling and re-enabling hardware prefetchers on a "live" system is typically safe, it is quite possible that there are corner cases in which such changes could cause the system to hang or generate incorrect results.  (That other vendor of x86_64 processors documents the MSRs required to control both the "core" and "memory controller" prefetchers.  The documentation does not address the issue of whether these are safe to modify on a "live" system.)  Enabling/disabling hardware prefetch is not a feature that could easily be considered "necessary" for customers (especially since the BIOS-based alternative exists), so the expense of exhaustive testing would have to be considered a very low priority in Intel's engineering budget.

John D. McCalpin, PhD "Dr. Bandwidth"
imagem de kopcarl

Quote:

John D. McCalpin wrote:

I don't know of any public references to the configuration bits used to disable prefetchers on the various Intel processors. 

Enable/disable hardware prefetch functionality is available via BIOS options on many systems, so it must be documented for the BIOS writers. 

This may be a case of simple caution -- although disabling and re-enabling hardware prefetchers on a "live" system is typically safe, it is quite possible that there are corner cases in which such changes could cause the system to hang or generate incorrect results.  (That other vendor of x86_64 processors documents the MSRs required to control both the "core" and "memory controller" prefetchers.  The documentation does not address the issue of whether these are safe to modify on a "live" system.)  Enabling/disabling hardware prefetch is not a feature that could easily be considered "necessary" for customers (especially since the BIOS-based alternative exists), so the expense of exhaustive testing would have to be considered a very low priority in Intel's engineering budget.

Thanks. One more question,  the Optimization Reference Manual 2.2.5.4 says data prefetch to the L2 and Last Level Cache are fetched from memory to the L2 cache and last level cache. Is this implying that Streamer and Spatial Prefetcher (MLC prefetchers) fetch data from memory directly?

imagem de John D. McCalpin

Most recent Intel processors have two "L1 prefetchers" and two "L2 prefetchers".  See, for example, the discussion of the Sandy Bridge core in the Intel Software optimization guide.  (The Nehalem/Westmere cores are similar.)   The "L1 prefetchers" bring data into the L1 cache, while (if I recall the wording correctly) the "L2 prefetchers" bring data into either the L3 or L2 cache, depending on how busy the system happens to be.

If an L2 prefetch finds data in the L3 cache, then it won't go all the way to DRAM, but any of the L1 or L2 prefetches will propagate all the way out to memory if necessary to find the desired cache line.

The "Last Level Cache Miss" event discussed in this thread appears to be incremented when demand misses or L1 prefetches miss in the L3 cache, but is not incremented when L2 prefetches miss in the L3 cache.  

John D. McCalpin, PhD "Dr. Bandwidth"
imagem de iliyapolak

I wonder if prefetching implementation maintain some kind of history of prefetching distance table which could be based on application performance(count of cache misses).

Faça login para deixar um comentário.