Is there any performance improvement with non-temporal stores w.r.t regular stores on Xeon Phi? also vectored stores w.r.t. regular store ? I did some tests on this and results showed otherwise, hope you guys could shed some light on this.
I created a simple benchmark (attached) which transfers a large array (i used 8MB and 2GB arrays) to a destination memory address using OMP threads.- 60 threads which were pinned to each core of the Xeon Phi and the transfer time was measured. Here 3 different store instructions were used - Regular store/ vector(SIMD) store/ vector non-temporal store .Following are the results I observed.
Test that read from 'source' and write on ''dest' - read/write BW
time to transfer (us)
Vector NT store
It looks like vectored stores including Non temporal (NT) is slower and have less throughput than the regular 'store'. It is difficult to explain this result since at least Vector NT store instructions should ideally save bandwidth and produce a high throughput when message size is sufficiently larger than the cache. Is there any reason for this behavior ? Appreciate your feedback on this