hw prefetching

hw prefetching

stdweird's picture

hi all,

i'm playing a bit with hardware prefetcing on our dual L5420 nodes.
most info indicates that the performance results should vary from application to application (eg
http://software.intel.com/en-us/articles/optimizing-application-performa... )

i have disabled one of and even both both the hardware prefetecher and the adjacent cache-line prefetch option (both through bios and through msr), but much to my surprise, i don't see any differences at all.
i've tried both real applciations and synthetic ones, and now i'm starting to suspect something else is not ok.

my question is the following: what simple executable/benchmark should clearly demonstrate a difference? i have already tried the whole HPCC suite, and things link STREAM, randomaccess or hpl don't differ much (< 2%) between on or off, where i would expect otherwise.

thanks for any suggestions,

stijn

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Dmitry Kuzmin (Intel)'s picture

Hi Stijin,

I'm not sure that this forum is good place to ask such question. This forum is dedicated for questions about Intel MPI Library, Cluster Tools and High Performance Computing.

Regards!
Dmitry

Tim Prince's picture

Random access presumably is designed to defeat cache optimizations including prefetch.
Adjacent sector prefetch is entirely likely to have little effect on normal HPC applications.
I haven't looked into details of HPL. If you want to start from scratch, rather than consult those who may have analyzed it back when 5420 was a current product, you would look into details of cache behavior with hardware prefetch on and off by using a tool such as VTune. Of course, you should check that you are using affinity correctly, particularly if running OpenMP versions of STREAM, or an MPI which doesn't set affinity by default.
In applications with which I'm familiar, hardware prefetch is most effective when the application bandwidth demand is lower than hardware maximum, as the prefetch generally increases bandwidth demand.
This forum is primarily dedicated to Intel MPI related questions, although it may not be clear from the title, so you should be specific if you are looking for help outside that area.

Login to leave a comment.