Preventing a double-precision number to be written to memory

Preventing a double-precision number to be written to memory

In a scientific application, I need to avoid the cost of writing data to memory. I want to prevent an array of double-precision numbers to be written to memory. The array should reside in L2 cache as long as possible. The size of the array is about 64 kilobytes. The array may be read or written by other threads. At the end of execution, the array can be written to memory. Is this achievable? Are there any pragmas or functions to enforce this constraint?

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If the array is compiled with streaming stores (check -opt-report) this prevents the data remaining in cache.  You can over-ride this by setting "-opt-streaming-stores never ". Then you could set #pragma vector [aligned|nontemporal] where you want those. The docs are difficult enough to understand that even the compiler writers may not have understood it the same for all situations.

If you set options such as those suggested which avoid generation of clevict, the data will remain in cache as long as possible.  You could set #pragma nontemporal for arrays which you don't want remaining in cache so that they don't cause capacity evictions of data you want to retain.

You would likely need VTune analysis to check the effect of such changes.  I don't think you could guess a priori whether it's better to encourage threads to read data updated by other threads from cache or memory, but VTune general analysis should allow you to see whether the associated events are changed significantly.  You would need a consistently effective affinity strategy, such as

KMP_PLACE_THREADS=60c,4t

OMP_PROC_BIND=close

in order to be able to compare VTune results between runs

MIC memory is significantly faster than host memory, as well as cache being smaller relative to number of threads, so the strategies you needed on other platforms may not be applicable.

 

There is no way to enforce the behavior you want on any cached system, though there are sometimes approaches you can use to help make the desired behavior more likely.

One approach that might work (though the implementation on Xeon Phi is tricky) would be to perform "ordinary" loads to get the array that you want to retain into the cache(s), then use the "EH hint" on the loads to other data to indicate that this other data is "non-temporal" and should be given priority for eviction.  

You can read about the Xeon Phi implementation of the "EH hint" in section 3.7 of the Xeon Phi Coprocessor Instruction Set Architecture Reference Manual (document 327364-001, September 7, 2012).   I can't keep it straight in my head, but it may be useful.

If you know that you are finished using the array, Xeon Phi provides explicit instructions to evict the data from the L1 or L2 caches.  In the same manual, the CLEVICT0 and CLEVICT1 instructions are described.  CLEVICT0 evicts data from the local core's L1 cache (writing it out to the next level of cache if the data is modified), while CLEVICT1 evicts data from the local core's L2 cache (writing it out to memory if the data is modified).  Unlike the CLFLUSH instruction on other Intel processors, these "EVICT" instructions only apply to the local caches of the core executing the instruction and are not broadcast across the cache coherence domain.  (This is the same principle covered by my U.S. Patent 7,194,587, but split into separate instructions for the local L1 cache and local L2 cache.)

 

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today