Very low L3 cache hits on Sandy Bridge according to PCM

Very low L3 cache hits on Sandy Bridge according to PCM

Аватар пользователя matt_garman

We develop ultra low latency proprietary software.  We are struggling with the transition from Westmere to Sandy Bridge, as our programs seem to run 10--30% slower on an SNB E5-2690 versus a Westmere X5690.  To investigate this, we integrated PCM v2.2a into our code (downloaded from [1], although as of now it looks like v2.3 is available).  We ran two parallel instances of our program suite, one on Westmere and one on SNB.

What we saw was an average L3 hit ratio of about 73% for Westmere, but only about 17% for Sandy Bridge.  If anything, we expected that the L3 hit ratio should be much higher on SNB, because the cache is significantly larger (20MB versus 12MB).  We suspect that this is why we are seeing such a big performance discrepancy between Westmere and SNB.

So it appears that either (1) something is churning through the CPU, killing our cache, or (2) we have missed disabling some power-saving feature.

We are running CentOS (RedHat) v5.7.  Our kernel commandline looks like this: "intel_idle.max_cstate=0 selinux=0 idle=poll nox2apic intremap=off processor.max_cstate=0 nohalt isolcpus=2-15".  (We isolate CPUs from the scheduler so that we can programatically pin the most latency-sensitive threads to individual cores.)  We have disabled just about all system services (at least the running services are the same on both Westmere and SNB systems).  At the BIOS level, we have disabled all the power-saving features we possibly can (including C-states and C1E).  (The SNB machine is a Dell R620, and we followed Dell's low latency tuning guide[2].)

Anyone have any thoughts on what might be causing such low L3 cache utilization?

[1] http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

[2] http://www.dell.com/us/enterprise/p/d/shared-content~data-sheets~en/Documents~configuring-low-latency-environments-on-dell-poweredge-12g-servers.pdf.aspx

15 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Roman Dementiev (Intel)

Hi,

what about the absolute number of L3 cache misses on Westmere vs. Sandy Bridge? Is your application single-threaded or multithreaded? It would help if you can post the output of pcm.x here from both systems.

Roman

Аватар пользователя matt_garman

Our program is multithreaded. However, only one thread is performance critical. So what we do is use the "isolcpus" kernel parameter to isolate cores from the kernel process scheduler. We launch the application as normal, but programmatically pin the performance-critical thread to one of the isolated cores (pthread_setaffinity_np()).

Because we are only interested in the performance of that one thread (and not the program as a whole), we are not using pcm.x. Instead, we modified our code to collect PCM stats on just that one thread. We didn't collect absolute cache misses, only hit ratios, and cpu cycle lost ratios.

The way we are looking at it, we assumed that a low hit ratio would imply a high number of absolute misses. The programs and OS config (save the kernel command line) are identical. Is it incorrect to assume the caching behavior should be similar on the two machines? If so, why?

I'll add that, along with the lousy 17% L3 hit ratio, we also see nearly 40% of CPU cycles lost due to L3 cache misses.

Versus Westmere, where we had 73% L3 hit ratio, and only 8% of CPU cycles lost due to L3 misses.

We can further modify our code to include the absolute number of cache misses. But in the meantime, I'm interested in knowing what additional information that will provide. (Not trying to be argumentative, these are honest questions!)

Thanks!

Аватар пользователя matt_garman

Hi, we modified our program to collect additional PCM stats. We don't use pcm.x because we are only concerned with the performance of one thread (which is running on an isolated core).

Here is one run of SNB versus WSM:

Sandy Bridge:
EXEC: 0.7332
IPC: 0.6451
FREQ: 1.1366
AFREQ: 1.1379
L3MISS: 89948
L2MISS: 164103
L3HIT Ratios: 0.4800
L2HIT Ratios: 0.6134
L3CLK: 0.1474
L2CLK: 0.0327

Westmere:
EXEC: 0.5902
IPC: 0.5594
FREQ: 1.0550
AFREQ: 1.0555
L3MISS: 54319
L2MISS: 253901
L3HIT: 0.7625
L2HIT: 0.5025
L3CLK: 0.0755
L2CLK: 0.0591

Note that is for 1020 iterations of this thread. The same exact binary, same operating system, same CPU isolation scheme, same input data. Only difference is CPU architecture.

Sandy Bridge had 48% L3 hit ratio, 15% of CPU cycles lost due to L3 misses, and 90k L3 misses.
Westmere had 75% L3 hit ratio, 8% of CPU cycles lost due to L3 misses, and only 54k L3 misses.

Thanks!
Matt

Аватар пользователя matt_garman

By the way, one idea was that perhaps SNB's fancy new pre-fetching was simply not working for our particular code. The BIOS for my Sandy Bridge machine (Dell R620) has four options for pre-fetching:

Adjacent Cache Line Prefetch
Hardware Prefetcher
DCU Streamer Prefetcher
DCU IP Prefetcher

I repeated my test for every possible combination of these settings enabled/disabled (16 total combinations). With any one option disabled, the L3 hit ratios actually got worse still.

Аватар пользователя Roman Dementiev (Intel)

Matt,

thank you for sharing the metrics. From the absolute number of cache misses it seems they are very short (can contain some noise and overhead included) - 10-100 ms? Nevertheless lets do some analysis:

We can compute the instruction throughput for both system by multiplying the nominal frequency with EXEC:

Westmere: 3.46 x 0.59 = 2.04 G instructions/second
SNB: 2.9 x 0.73 = 2.12 G instructions/second.

This means SNB executes more instructions per second than the Westmere. But your program runs slower on SNB! This makes me think that on SNB you execute (slightly) different code path compared to Westmere. You really need a profiler to see what code is executed and hot both on Westmere and on SNB and compare.

Thanks,
Roman

Аватар пользователя matt_garman

Quote:

Roman Dementiev (Intel) wrote:

This means SNB executes more instructions per second than the Westmere. But your program runs slower on SNB! This makes me think that on SNB you execute (slightly) different code path compared to Westmere. You really need a profiler to see what code is executed and hot both on Westmere and on SNB and compare.

This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.

We have done profiling as you suggest. We put about a dozen rdtscp calls in this code path, stored them in a table, and output them at the end of execution. We were hoping to isolate one section of code that was particularly slow on SNB versus WSM. But, it was just a general overall slowness. All of our "stopwatch" points were just a little bit higher on SNB. In other words, the added execution time was for the most part evenly distributed.

Any other ideas?

Thanks,
Matt

Аватар пользователя Roman Dementiev (Intel)

This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.

does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops?

Best regards,
Roman

Аватар пользователя matt_garman

Quote:

Roman Dementiev (Intel) wrote:

This is a relatively short code path that we are studying: maybe a dozen or so screens worth of code. I really don't think an alternate code path is possible, as the program wouldn't be correct.

does your code execute any spin locks or other thread synchronization primitives that may have different number of executed iterations in spin loops?

Yes, that is a good point that I didn't originally consider. We use Linux pthreads, which IIRC do use some spin locks "under the covers".

Also: this is an event-driven thread. That thread is spinning on a SysV message queue waiting for the next event. Note that I also tested without the busy event waiting: this has virtually no impact on the L3 hit ratio of Westmere, but drops the SNB L3 hit ratio even more.

Thank you, I appreciate your feedback.
-Matt

Аватар пользователя Dmitri

We seem to be experiencing similar issues. I'm wondering if there's any progress on this subject.

Аватар пользователя Roman Dementiev (Intel)

Hi Matt,

just released Intel PCM 2.35 fixes sometimes wrong cache statistics by applying a special workaround for Intel Xeon E5 (based on Intel microarchitecture codenamed Sandy Bridge-EP and Sandy Bridge-E). It make sense to remeasure cache statistics with this version.

Best regards,

Roman

Аватар пользователя Pavel Kogan

I think we have the same problem :(

Old Dual Xeon E5645 2.4GHz (Nehalem) performs the same as new Dual Xeon E5-2620 2.0GHz (Sandy bridge) when running in single main thread, but outperforms E5 by almost factor x2 when running in multiple threads. The executable is the same.

Regards, Pavel

Аватар пользователя James D.

I observed similar performance (missed cache, running time) degradation when running frequent lock-prefix primitives with multiple threads. Any progress in this thread? or Any suggestion?

Regards,
James

Аватар пользователя Roman Dementiev (Intel)

James,

too frequent locked/atomic operations might be the reason of non-scaling. There is a new study discussing such trade-offs on new processor architectures: "Lock Scaling Analysis on Intel® Xeon® Processors".

Roman

Аватар пользователя matt_garman

I don't know if it's related to locking or not, but for us, the "magic bullet" was a Linux kernel version change. In particular, see this changelog:

    http://rpm.pbone.net/index.php3/stat/22/idpl/16999719/com/changelog.html

I believe the change that helped us is this one:

    [x86_64] Revert ACPI APIC mode test (Prarit Bhargava) [728163 721361]

I actually contacted Prarit and asked him about this. He said that they were trying to use system ACPI tables to program interrupts (APIC). However, many systems have their ACPI tables wrong for whatever reason, which resulted in incorrect APIC settings. Having the interrupts programmed incorrectly can result in very poor system performance.

Note that this patch improved the performance of both Westmere and Sandy Bridge, but the improvement was much more dramatic on SNB.

-Matt

Зарегистрируйтесь, чтобы оставить комментарий.