I am  executing a single threaded copy read program which is pinned to a core. and the program is complied with -O0 -no-vec -no-opt-prefetch options.


static int a[STREAM_ARRAY_SIZE];

int j,k;



for (j=0; j<STREAM_ARRAY_SIZE; j++)


I use VTUNE to read the performance counter. When STREAM_ARRAY_SIZE= 1*10^6 or 2*10^6 Both L2_DATA_READ/WRITE_MISS_CACHE_FILL are 0. and if with 4*10^6 I see a value of 10000.

In xeon phi we have 32KB private L1 and 512 KB * 60 shared L2 cache i,e a total of 30MB of L2 cache. Suppose if i read a static array which is bigger than 512KB will the extra data gets filled in other cores L2 space? Is this the reason for L2_DATA_READ/WRITE_MISS_CACHE_FILL on L2 miss in the currently pinned core?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The cores on the Xeon Phi do not "spill" data into other L2 caches -- they can only evict into their own private L2 cache.  Victims from the private L2 caches go to memory.

There is a fair chance that the L2 to L2 transfers you are seeing are just "noise".   I don't think that VTune can completely isolate the counts for your application from the count for independent OS processes, but even if it could you would still have OS operations on behalf of your process that might be counted (such as instantiating pages) as well as other activities such as page table walks that could find data in another processes private cache.  

If I am correct about this, the number of L2 to L2 transfers you see will not be a function of the array size, but may be a function of the run time.  I often see a few thousand or a few tens of thousands of unrelated transactions in many of the counters during benchmark runs, so I try to size the problems so that the events I am interested in happen at least a million times.

John D. McCalpin, PhD
"Dr. Bandwidth"

If less than 2% of the L2 misses are reported as hitting in other caches, I'm with John in considering it negligible.

The effective way to use multiple caches to cache more data is to thread the application with thread data locality.

If you're running a thread without affinity, so that it wanders across cores, that could explain a greater rate of cache misses hitting in other caches (where that thread was recently running), but I wouldn't be proud of it. It would be a lot of trouble to attempt to verify the details.

Continuing the previous discussion, I find ambiguity in the cache hierarchy when perf counters are analyzed using 2 different tools. VTUNE and Likwid. You can see that in the attached "data.xls" file. This is same program as mentioned above with a[i]=1 for write and k=a[i] for read inside the loop. All these values are collected over individual run for every event. And the thread/process was pinned to core 1 in mic. All these were run with same compiler options as mentioned above. 

In likwid, I measure the whole activity in core 1 while the process runs. In Vtune, I filter my results w.r.t the copy process. So, in both cases am assuring that the values are only of the concerned part.

INSTRUCTIONS_EXECUTED, DATA_READ_OR_WRITE, DATA_READ, DATA_WRITE looks almost fine and equal but the problem happens on the events after that. VTUNE shows twice the number of data misses compared to likwid. But still they seem correct individually as DATA_READ_MISS_OR_WRITE_MISS = DATA_READ_MISS + DATA_WRITE_MISS


More over, I am not able to correlate the numbers with the array_size. I,e how the misses are accounted with real program. Again the array size here is 4*10^6.






Downloadapplication/vnd.ms-excel data.xls30 KB

The code fragment above does not initialize the array a[], so the behavior of the loop depends on how the OS handles reads to uninitialized pages.  If I understand correctly, when you read an uninitialized page in Linux, it maps that page to the "zero page", and does not actually instantiate a separate copy of that page in physical memory.

I recommend that you add an explicit loop to initialize the a[] array to some value, then add a loop to repeat the read kernel a variable number of times.  You can either repeat the kernel enough times that the initial overhead is negligible or you can subtract the results from two different iteration counts (with the same array size) to get both the per-iteration counts and the common amount of overhead.  I typically do both -- run 1000 iterations just to make sure I am seeing asymptotic behavior and also run 1 iteration and 11 iterations so that the math on the differences is easier.

I have looked at the performance counter counts for the page instantiation process on my Xeon E5-2680 and the numbers don't fit any simple model.  (Given the complexity of the code path involved in page instantiation in the Linux kernel, this is not particularly surprising.)   Most applications are run for enough iterations that the overhead of instantiating the pages is not large, so I recommend focusing on either asymptotic results for large iteration counts or on the per-iteration counts obtained from the differences of two runs.

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today