questions about L2 wirte policy

questions about L2 wirte policy

Hi all,

I notice that Xeon Phi has large coherent L2 cache. I'd like to figure out more details. My question is about where to apply the replacement policy.

Suppose thread0 in core0, want to read a data, suppose the data is neither in the local L2 or in other cores' L2, then it will access the main memory and bring the data to the L2 cache. My question is, if the local L2 is full, Xeon Phi will apply the cache replacement policy to the local L2, or possibly to other core's L2? (for example, if other cores' L2 is not full, then it can directly use that L2 cache lines?)

I cannot figure out this detail from documents, help you guys can help, thanks very much

6 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Each core has its own L1 and L2 caches, and makes accesses from them. So in the case you discuss some line will be evicted from the local cache to make room for the newly required data.

Similarly, if another core later requires the same line it will also make space for it in its local cache, and then bring a copy to there (from one of the other L2s if it's already present on chip). So the same, unmodified, data can be present in each of the L2 caches on the chip.

The way to think of the machine is that each core has its own cache, and that all those caches are maintained coherent, not that there is one, large, shared L2 cache. (People familiar with Xeon, which does have a large, shared, L3 cache sometimes say that the Intel(r) Xeon Phi(tm) coprocessor "doesn't have a last-level cache", which is clearly wrong by definition :-), but they are right that there is no shared last level cache).

Thanks James! So another question is whether user can change the L2 cache policy? I see a table in the system software developer guide lists pseudo-LRU as the cache policy for both L1 and L2. Is it possible to use other policies? Thanks a lot!

引文:

James Cownie (Intel) 写道:

Each core has its own L1 and L2 caches, and makes accesses from them. So in the case you discuss some line will be evicted from the local cache to make room for the newly required data.

Similarly, if another core later requires the same line it will also make space for it in its local cache, and then bring a copy to there (from one of the other L2s if it's already present on chip). So the same, unmodified, data can be present in each of the L2 caches on the chip.

The way to think of the machine is that each core has its own cache, and that all those caches are maintained coherent, not that there is one, large, shared L2 cache. (People familiar with Xeon, which does have a large, shared, L3 cache sometimes say that the Intel(r) Xeon Phi(tm) coprocessor "doesn't have a last-level cache", which is clearly wrong by definition :-), but they are right that there is no shared last level cache).

I know of no way to change the hardware cache policies.

However, you can use the compiler's nontemporal hint to ask it to attempt to avoid placing some variables in the cache or, if you're using assembler code the streaming store operations yourself. This discusses when the compiler chooses to use those itself, and shows you the intrinsics if you want to go to that level.

The nontemporal hint is much more frequently useful on MIC than on host, so you may require e.g.

#if defined __MIC__

#pragma vector [aligned] nontemporal

#endif

so as to instruct the compiler to use this for MIC only.  The compiler should be smart enough to apply nontemporal only to eligible arrays in the designated loop, but it's the programmer's responsibility to find out whether immediate cache eviction is wanted.  We've seen cases where this speeds up MIC KNC by 30% but if applied on host slows it down by as much as 10x.

I suspect pragma nontemporal doesn't work with CEAN array assignments.

One more point is that you can also use the _mm_clevict intrinsic to force data out of the cache.

发表评论

登录添加评论。还不是成员?立即加入