Usage of _mm512_mask_prefetch_i32gather_ps for doubles

Usage of _mm512_mask_prefetch_i32gather_ps for doubles

Portrait de Alastair M.

Dear all,

I want to implement prefetching for sparse complex double precision data using Intrinsics.

A linear array contains the indexes of the sparse complex double elements like so {1,2,3,4,150,151,7000,7001,10000,10001}

As each of these elements are 16 contiguous bytes in memory, how should I use the prefetch intrinsic meant for single precision floats correctly?

Should I use _mm512_mask_prefetch_i32gather_ps() and explicitly prefetch each 4 byte piece of the 16 bytes?

Or can I expect that each element in the index register will cause 64 bytes to be prefetched into cache?  In that case I could perform some modular arithmetic on the index values to only prefetch individual unique cache lines. (I have actually tried this approach with disappointing results)

Best regards,

Alastair

10 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Alastair M.

FYI, I also posted an updated version of this question on Stackoverflow here: 

http://stackoverflow.com/questions/24627877/will-mm512-mask-prefetch-i32gather-ps-prefetch-an-entire-cache-line-for-each

Best regards,

Alastair

I sent a reply yesterday at Taylor's request, but I see it hasn't been posted.

I'd suggest trying methods for indirect prefetch shown in

https://software.intel.com/sites/default/files/article/326703/5.3-prefet...

or, now that you made it definite that you need just 2 cache lines, issue a simple prefetch to one array element in each cache line (both L2 prefetch, and L1 prefetch at a shorter distance).

Gather prefetch intrinsics haven't often proved useful on KNC; I don't know whether your case is more favorable, since it appears that only 2 repetitions are needed.

Portrait de Alastair M.

Citation :

Tim Prince a écrit :

I sent a reply yesterday at Taylor's request, but I see it hasn't been posted.

I'd suggest trying methods for indirect prefetch shown in

https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-5.pdf

or, now that you made it definite that you need just 2 cache lines, issue a simple prefetch to one array element in each cache line (both L2 prefetch, and L1 prefetch at a shorter distance).

Gather prefetch intrinsics haven't often proved useful on KNC; I don't know whether your case is more favorable, since it appears that only 2 repetitions are needed.

Thanks for your reply Tim.  To hear your experience is very useful.  

The amount of unique cache lines that I need in this case would be a maximum of 3,but often quite 0,1 or 2.  By simple prefetch do you mean _mm_prefetch()?  I thought that this would negate the benefit of having the gather prefetch instruction as it would require going from xmms via L1 to get each offset value I need.

Best regards and thanks again,

Alastair

Yes, I'm having problems with communicativity today (spam attacks), should have got the right URL pasted in there now.

https://software.intel.com/sites/default/files/article/326703/5.3-prefet...

I haven't seen a detailed explanation of why gather prefetch doesn't prove useful on KNC, but I'd guess it may be like gather load in that it requires iteration over the number of distinct cache lines involved, thus no benefit over individual _mm_prefetch.  If you can use the methods in that reference it may give a starting point to deal with prefetch distances.

 

Portrait de Alastair M.

Citation :

Tim Prince a écrit :

Yes, I'm having problems with communicativity today (spam attacks), should have got the right URL pasted in there now.

https://software.intel.com/sites/default/files/article/326703/5.3-prefet...

I haven't seen a detailed explanation of why gather prefetch doesn't prove useful on KNC, but I'd guess it may be like gather load in that it requires iteration over the number of distinct cache lines involved, thus no benefit over individual _mm_prefetch.  If you can use the methods in that reference it may give a starting point to deal with prefetch distances.

Hi Tim, thanks again for your response, I fixed up the link in my last reply to you.

I actually have read this document before and implemented prefetching for the dense data in my application.  I saw very big performance improvements by explicitly prefetching the linear accesses of 2 dense data arrays into L2 and L1 respectively and profiled a wide range of distances to find the optimum values.

I was really surprised by this because I thought that the hardware prefetchers would handle prefetching of the dense arrays easily.  I was then surprised again to see that using gather prefetch for the sparse data actually slowed the application down.  I am starting to wonder if the gather loads are actually effectively prefetching the sparse data for me as my sparse data has three distinct but almost linear address streams.

Getting slightly off topic now, I then found that explicitly evicting the cache lines for the two dense arrays actually improved the performance again quite significantly.  

In summary this is what I found for my application (which linearly accesses two dense arrays (and a third every four iterations) and a single sparse matrix array:

 - Explicit prefetching of dense arrays = big speedup

 - Explicit eviction of dense data = big speedup

 - Explicit prefetching of sparse data = slowdown

 - Overall, by implementing these optimisations (and other computational ones) I achieved just over 3x speedup over the original fortran implementation.

Does any of that match up with your experience on KNC?

Best regards,

Alastair

Hardware prefetch may have assisted your sparse streams, particularly if Transparent Huge Pages came into play.  Hardware prefetch probably requires several more cache misses before it becomes active. I believe it can detect only 8 streams on KNC, brings cache lines only into L2, and would not match your optimized prefetch distances.  I haven't seen any discussion of role of hardware prefetch or gather prefetch on future KNC models.  There certainly is room for improving the number of cache lines which could be accessed in parallel by gather and scatter instructions on KNC.

In an application where there is cache capacity limitation, eviction of cache lines which are no longer wanted definitely can help; it's a way of improving on the least recently used eviction.  The opt-streaming-store options of the compilers imply use of clevict.   In one application, re-organizing for better cache locality made more of a difference when clevicts were removed; apparently, the cache lines which we originally wanted to evict immediately became useful on later time steps.

Portrait de John D. McCalpin

The hardware prefetchers on Xeon Phi are less sophisticated and aggressive than those on the mainline processors.

It is important to note that there are no L1 hardware prefetchers on Xeon Phi.   When this is combined with an in-order core, it quickly becomes clear that software prefetching is required even for L2-resident data.  L2 load latency is about 23-25 cycles, and the core appears to be able to handle 8 outstanding L1 Data Cache misses.  Without software prefetches to L1 (vprefetch0), the core will typically stall on each L1 cache miss, giving a throughput of 1 cache line load every 25 cycles.  Since the threads don't stall each other, 4 threads will give you 4 cache lines every ~24 cycles -- about 1/2 of the maximum throughput.    With software prefetches to L1, a single thread can generate 8 outstanding L1 Data Cache misses, using 8 of the ~12 issue slots available during each ~24 cycle L2 latency.  (Note that despite the "v" at the beginning of "vprefetch0", this is not a vector instruction -- it can issue in the scalar pipe in parallel with a vector instruction.)

My experiments with the L2 hardware prefetchers suggest that for each core, the L2 hardware prefetcher tracks accesses to the 16 most recently accessed 4KiB pages, and can prefetch at least 4-5 cache-line-pairs in each of those pages.   For example, a pointer-chasing code using a fixed stride of 256 Bytes or more does not appear to activate the L2 hardware prefetcher (showing the "typical" average 277 ns memory latency), while a 128-Byte stride sees a 4.2x-4.3x speedup (64 ns/load) and a 64-Byte stride sees an 8.5x-8.7x speedup (32 ns/load).   This does not represent the maximum amount of concurrency available, since there are no VPREFETCH0 instructions to fetch the data from L2 to L1 -- i.e., with "perfect" prefetching it would still take the same ~21.3 ns/load that I observed with L2-resident data.  The 31.8 ns average latency that I observed is only 10.5 ns higher than the L2 latency, so the hardware prefetcher was able to hide about 96% of the 277 ns average memory latency.   I can account for most of this 10.5 ns overhead by assuming that the hardware prefetcher restarts at the beginning of each 4 KiB page and takes 2 full memory latencies to ramp up.   But this is just a guess -- the Xeon Phi Time-Stamp-Counter can be read in ~6 cycles, so it is possible to measure the latency of each load independently and watch the hardware prefetch ramp up.

So the latency test does not expose the maximum concurrency that the L2 hardware prefetchers can deliver (even for a single load stream).   I can get a lot more concurrency for a simple read-only code accessing a single data stream (sum reduction of an array).   This case delivers about 4.8 GB/s using one thread, which corresponds to an effective concurrency of just under 21 cache lines at a nominal average latency of 277 ns.  (277 ns * 4.8 GB/s = 1330 Bytes = 20.8 cache lines).

Using multiple threads on the same core will increase the number of independent 4KiB pages being accessed, so it should allow more L2 hardware prefetches.  It does, only by about 5% in my tests.   (When I ran these tests, I did not know anything about how addresses were mapped to memory controllers, so I was not able to tell whether my four threads were "spreading out" their accesses across more controllers or "clumping" them to a subset of the 16 DRAM channels.  I should go back and look at this again.)  The 5% performance improvement that I get when running 4 threads on one core is about the same as the performance improvement that I get using large pages (which allow software prefetches to cross 4KiB boundaries).

John D. McCalpin, PhD "Dr. Bandwidth"

Very interesting contribution by John.  From that, I might infer that L2 hardware prefetch may contribute when at most 1 cache line is skipped in the access pattern.

In my own tests, Transparent Huge Pages could nearly double the performance obtained with compiler generated prefetch defaults when typically 1 cache line per 4KB page is used (and only 1 array element per cache line).  I don't see any easy way to estimate how much of the penalty incurred by entering a new page is from occasional TLB miss and how much from restarting hardware prefetch in the manner John described.

Portrait de Alastair M.

Dear Tim and John,

Thank you both very much for your insightful comments and help.  I have some new ideas to try based on your input.  I will test these and report the results back here.

Best regards,

Alastair

Connectez-vous pour laisser un commentaire.