the performance of scatter and gather

the performance of scatter and gather

i have tested the new scatter and gather intrinsics for vectorization. I expected it should be better than the scalar scatter and gather. However, from the evaluation, the performnace is almost the same as the scalar scatter/gather, no matter I enable or disable the auto-vectorization (-no-vec). Does anyone have experience using the new scatter/gather? I appreciate it if you can share your results.

Thanks in advance!

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Have you read this article about organizing access in the source to help the compiler figure out opportunities for using vector scatter-gather?

http://software.intel.com/en-us/articles/bkm-coaxing-the-compiler-to-vec... 

Even with vector scatter-gather though, data distribution and object size, insofar as they affect cache reuse, can affect ultimate performance.  If every scatter index requires a different cache line, data fetch will dominate the execution cycle.

This subject is too broad to answer in the forum context.  Maybe you could give a concrete example of what your are trying to do.   In connection with Robert's reply, you must look at least at vec-report to see whether the compiler reports an effort to make use of scatter-gather, and what it has done with software prefetch.

In the reference Robert gave, the code is shown which repeats the gather instruction for each cache line which is touched by the instruction.  If your stride is larger than cache line size, you may do no better than scalar loads, even in the unusual case where software prefetch is fully effective.

The Transparent Huge Page facility of the last 2 releases of MPSS is a signficant booster of scatter-gather performance, but you still don't get full performance unless you are using all the elements of each cache line without incurring intermediate cache misses.

Scatter is likely to produce false sharing in the usual case where more than one thread updates elements of a cache line.  In a read-only case, threaded scaling could be satisfactory with gather, with the threads cooperating to share cache across cores.

Scatter doesn't offer the opportunity for nontemporal or streaming stores in the case where you would like to avoid reading each cache line before storing to it. Such cases are far more frequent on Intel(c) Xeon Phi(tm) than on server CPUs.

Thanks for your both answers. In my case, when I test the vectorization scatter/gather, I compare two versions:

(1) I do not rely on the auto-vectorization, instead, I implement them manually using  _mm512_i32scatter_epi32/_mm512_i32gather_epi32 for int32 data types. So this is the vectorized scatter/gather version

(2) I use scalar scatter/gather, to avoid auto-vectorization for these operations, I use no-vec for the compiler

I think your answers make sence, the stride is largeer than the cache line size, thus the performance is limited by the memory latency essentially.

Thanks for your both answers. In my case, when I test the vectorization scatter/gather, I compare two versions:

(1) I do not rely on the auto-vectorization, instead, I implement them manually using  _mm512_i32scatter_epi32/_mm512_i32gather_epi32 for int32 data types. So this is the vectorized scatter/gather version

(2) I use scalar scatter/gather, to avoid auto-vectorization for these operations, I use no-vec for the compiler

I think your answers make sence, the stride is largeer than the cache line size, thus the performance is limited by the memory latency essentially.

Leave a Comment

Please sign in to add a comment. Not a member? Join today