We develop ultra low latency proprietary software. We are struggling with the transition from Westmere to Sandy Bridge, as our programs seem to run 10--30% slower on an SNB E5-2690 versus a Westmere X5690. To investigate this, we integrated PCM v2.2a into our code (downloaded from [1], although as of now it looks like v2.3 is available). We ran two parallel instances of our program suite, one on Westmere and one on SNB.
What we saw was an average L3 hit ratio of about 73% for Westmere, but only about 17% for Sandy Bridge. If anything, we expected that the L3 hit ratio should be much higher on SNB, because the cache is significantly larger (20MB versus 12MB). We suspect that this is why we are seeing such a big performance discrepancy between Westmere and SNB.
So it appears that either (1) something is churning through the CPU, killing our cache, or (2) we have missed disabling some power-saving feature.
We are running CentOS (RedHat) v5.7. Our kernel commandline looks like this: "intel_idle.max_cstate=0 selinux=0 idle=poll nox2apic intremap=off processor.max_cstate=0 nohalt isolcpus=2-15". (We isolate CPUs from the scheduler so that we can programatically pin the most latency-sensitive threads to individual cores.) We have disabled just about all system services (at least the running services are the same on both Westmere and SNB systems). At the BIOS level, we have disabled all the power-saving features we possibly can (including C-states and C1E). (The SNB machine is a Dell R620, and we followed Dell's low latency tuning guide[2].)
Anyone have any thoughts on what might be causing such low L3 cache utilization?



