I used Vtune to profile L2 cache miss of a java application on Xeon E7 (Westmere-EX A2). The counter I used is L2_RQST.LD_MISS.
To find which address accessing causes the cache miss, I digged into the assembly code provided by vtune.
But Vtune shows that a lot of cache misses were happend at instrunctions which only have register operation.
For example, following is a part of the result from Vtune:
Assembly L2_RQSTS.LD_MISS L2_RQSTS.LOADS L2_RQSTS.MISS L2_RQSTS.REFERENCES
mov r11d, dword ptr [r12+r10*8+0x34] 400,000 400,000
mov edi, dword ptr [r12+r11*8+0xc] 1,600,000 400,000 2,400,000 2,000,000
test edi, edi 17,200,000 14,800,000 26,000,000 33,600,000
jz 0x7f6fb2b98a6d <Block 103>