Hello, it seems I have some kind of misunderstanding. I am expecting that PREFETCHNTA prefetchs data to 2nd level cache and doesn't evict anything from L1D. But in vTune I can clearly see that in function that contains only prefetchnta (as a microbenchmark) many L1D.REPLACMENT events atributed to every non-temporal prefetch instruction. So it means prefetched data is actualy reach L1D cache, right?
What is wrong in my undertsanding or what did I miss? My intention is process block of data there every piece is needed only once, so that is why it would be better to avoid bringing it in L1D and use non-temporal operations.
Any recomendation for SandyBridge and new Intel platrforms?
BTW does non-temporal load to AVX register available in SB (somthing like MOVNTDQA)?
Thanks in advance.
" The non-temporal instruction is: PREFETCHNTA— Fetch the data into the second-level cache, minimizing cache pollution."
L1D.REPLACEMENT - Replacements in the 1st level data cache.