Perfomance monitoring issues

Perfomance monitoring issues

I am building a research bare-metal hypervisor for testing of shared cache partitions. The platform is an Intel i7 processor, 06_1Eh family/model. I am trying now to force a miss in the level 3 cache and observe the miss in the general performance monitor on the first core. The other cores have not been initialized. I have set the performance monitors to monitor events 0F.20, 2E.41, CB.10, and an uncore monitor event 09.03. Each of these events should report a miss in L3. I read one of the counters twice in a row and the values are different. I have verified that the memory types are WB, and caching is enabled CD,NW = (0,0). Why am I reading multiple L3 cache misses performing a back to back readback of the performance counter? Another researcher recommended disabling the prefetcher, but I cannot see how to do this on the Nehalem family processors. Is there a way to disable the prefetcher, can't find it in BIOS? Overall, I guess my question is, how do I read the L3 performance counters back to back showing no cache misses and then read > 8MB of memory and show a cache miss?

Ok, so for an update, I have downloaded the PCM code and compared it to how I am implementing my performance monitoring. There were some slight difference that I modified to make it match the PCM. I also modified my code so that my rdmsr macros only perform a rdmsr assembly instruction. This has allowed me to perform two reads of the program counter 0 msr back to back and detect no L3 cache miss. Again I am using the 0F.20 performance event. However, every instruction that I execute between the performance checks cause a L3 cache miss. The performance event should only count the number of L3 misses on retired loads. Why does every instruction then cause a L3 cache miss. I implement a large loop and I still get cache misses. The only related instructions that do not cause cache misses are: rdmsr, wrmsr, and rdpmc.
In the VMM i have verified that the cache is enabled, paging is enabled, the PAT indicates WB memory, and the MTRRs indicate WB memory for all user space. My performance counting code is as follows:

(In the VMM before VMLAUNCH)
#define wrmsrl(MSR, val) \\
do { \\
unsigned long eax, edx; \\
eax = (u32)(0x00000000FFFFFFFF & val1); \\
edx = (u32)(val1 >>32); \\
__asm__ __volatile__ ("wrmsr" : : "c" (MSR), "a" (eax), "d" (edx)); \\
} while(0);
#define rdmsr(MSR, eax, edx) \\
__asm__ __volatile__ ("rdmsr" : "=a" (eax), "=d" (edx) : "c" (MSR));

#define IA32_PERF_GLOBAL_CTRL 0x38F
#define PMC0_EN 1UL
#define IA32_PERFEVTSEL0 0x186

#define PMC_UMASK 8
#define PMC_EN (1UL<<22)
#define PMC_USR (1UL <<16)
#define PMC_OS (1UL <<17)
#define IA32_PMC0 0xC1

unsigned long MSR_val1, MSR_val2;
//disable counters while programming
wrmsrl(IA32_PERF_GLOBAL_CTRL, (u64)0);
//setup the performance event selector for performance counter 0 to count the number of retired loads that miss l3
wrmsrl(IA32_PMC0, (u64)0);
wrmsrl(IA32_PERFEVTSEL0, MSR_val1);
//enable the performance counter 0
MSR_val1 = PMC0_EN;
wrmsrl(IA32_PERF_GLOBAL_CTRL, MSR_val1);

(In the VM after VMLAUNCH)

#define READ_SIZE 0x400000
unsigned long cur_eax, cur_edx, pre_eax, pre_edx;
unsigned long MSR_val1;
intn = 0;
int i = 0;
unsigned long *pt;
//memory region to read
pt = (unsigned long *) 0x26100000;

//The two rdmsr reads will indicate that no instruction cache miss has occured.
rdmsr(IA_PMC0, pre_eax, pre_edx);
rdmsr(IA_PMC0, cur_eax, cur_edx);

//The two rdmsr reads will indicate thatone instruction cache miss has occured. Should this be the case, will each instruction between the reads cause a l3 cache miss?
rdmsr(IA_PMC0, pre_eax, pre_edx);\\
rdmsr(IA_PMC0, cur_eax, cur_edx);

//The two rdmsr reads will indicate that 25165825 instruction cache misses have occured. But I am only ready half the amount of memory available in the L3 cache. I should not be seeing so many cache misses. It seems like no caching is done.
for (n = 0; n < 10; n++)
rdmsr(IA_PMC0, pre_eax, pre_edx);
for (i = 0; i < READ_SIZE; i++)
temp += pt[i];
rdmsr(IA_PMC0, cur_eax, cur_edx);
printf("Cache misses: %d", (cur_eax | (u64)cur_edx<<32) - (pre_eax | (u64)pre_edx<<32));

When I set the CD bit in the CR0 register in the VMM I get the same results for the loop but larger cache miss values for back to back reads.
Please, any suggestions would be helpful.

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Heinerj,
Is this still an issue?
If I'm understanding you correctly, you are saying that the above loop gives you:
L2_misses = number_of_loops * READ_SIZE = 10 * 0x400000 = 335,544,320 misses
10 occurrences of 33,554,432 misses
Is this correct?
I would expect each of the READ_SIZE loops to fetch 33 MBs of memory and generate 33MB / 64 misses about 524,288 misses.
And, if you reduces READ_SIZE to something that fit into ... say... half of your L3, then I'd expect that your count should go to zero.

But doing things inside a VM throws a whole new wrinke into mix.
You could (sort of) easily check your loop on a standard linux. You might have to use the /dev/cpu/*/msr rdmsr/wrmsr interface. But this would tell you if the issue is something with the test program or something with how VMs do counters (or memory accesses).

Leave a Comment

Please sign in to add a comment. Not a member? Join today