a) If we use gcc's __builtin_prefetch(addr, 1); to prefetch a cache line for write, what are the factors that will determine that this line will remain in cache. i.e., as long as the code is accessing it enough, is it guaranteed to stay in L1 cache?

    b) If we use gcc's __builtin_prefetch(addr, 1, 1); to prefetch a cache line for read, what are the factors that will determine that this line will remain in cache. until accessed once? Also, if a writer writes to this line after the reader executes the prefetch, will the new data be prefetched automatically by the hardware?

    c) I am assuming __builtin_prefetch uses the PREFETCHTx instructions. Can someone confirm please?

Thanks for your help.


9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Madhav,

I don't know what gcc __builtin_prefetch() uses. I would have to use it in some code and then disassemble the code to be sure what it uses.


There are no "guarantees" that a line will stay in cache for any length of time, whether the line is brought into the cache by a hardware prefetch, a software prefetch, or a normal load or store.  This is the case for virtually any general-purpose processor designed in the last decade or two.

Software prefetch instructions typically do move data into some level of the cache hierarchy, and sometimes provide special behavior depending on some combination of the "temporal" hint(s) and the actual location and cache state of the cache line requested.   Unfortunately the behavior is strongly implementation-dependent and is does not appear to be documented for recent Intel processors.

The Intel Optimization Reference Manual (document 248966-028, July 2013) dedicates much of Chapter 7 to a discussion of optimizing cache usage software prefetch instructions, but the details are only provided for the Pentium 4 processor!    Similarly, the Intel Architecture SW Developer's Guide, Volume 2 (document 325383-047, June 2013) describes the behavior of the PREFETCH instructions only for the Pentium III and Pentium 4 processors.   (There is a bit more information about the implementation of software prefetch and temporal hints on Xeon Phi, but that information is quite unlikely to tell us about how software prefetch is implemented on more modern cores.)

It would take a strong knowledge of microarchitecture and validated hardware performance counters to design a set of microbenchmarks that could be used to test various hypotheses about the exact operation of the prefetch instructions.  I am not aware of any detailed analyses of how these are implemented in recent Intel processors -- but I would be happy to be corrected!

I don't know if it is a problem on all Ivy Bridge processors, but Agner Fog ( reports that while Sandy Bridge can execute two software prefetch instructions per cycle, Ivy Bridge can only execute one software prefetch every 43 cycles!  This should be relatively easy to test, in case you are running on an Ivy Bridge processor.

"Dr. Bandwidth"

The question about gcc builtin_prefetch seems a better question for gcc-help mailing list, once you have looked over gcc documentation and source code for the gcc version of interest, and can ask a more specific question, if you still have one.  It looks like prefetcht0 is a different 3rd argument from the one you wanted to use, assuming your target architecture is one where the question is relevant.   The cache level hints are interpreted differently by various CPU models, so there's a good chance it won't make a difference on a CPU you may be interested in.

As John indicated, interaction between software and hardware prefetch has been changed several times with new CPU introductions,

When data are written to a cache line, other copies of that cache line are invalidated.  Are you asking at what point after a cache line is flushed would other software prefetched copies of it be replaced?  I'm certainly not qualified to answer that, but I'd guess maybe not until accessed, on recent CPUs.

Thanks a lot gentlemen. I will follow up. 

The gcc builtin_prefetch translates to 

  4008c0:       48 83 ec 08             sub    $0x8,%rsp
  4008c4:       bf 00 04 00 00          mov    $0x400,%edi
  4008c9:       e8 2a fe ff ff          callq  4006f8 <_Znam@plt>
  4008ce:       0f 18 08                prefetcht0 (%rax)
  4008d1:       b8 00 00 00 00          mov    $0x0,%eax
  4008d6:       48 83 c4 08             add    $0x8,%rsp
  4008da:       c3                      retq
  4008db:       90                      nop


The processor we are on is Intel E5-2690.


Hi Patrick,

>>...I don't know what gcc __builtin_prefetch() uses...

It will be compiled to a prefetch.. instruction and 1 is for _MM_HINT_T2. In a well known form it looks like:
_mm_prefetch( ( RTchar * )ptAddress, _MM_HINT_T2 );

>>...I am assuming __builtin_prefetch uses the PREFETCHTx instructions. Can someone confirm please?

Yes, I confirm it.

However, I'm using more portable forms of prefetch-like instructions. In my codes all of them are wrapped with simple macros and they based on _mm_prefetch intrinsic function.

>>...4008ce: 0f 18 08 prefetcht0 (%rax)

I really don't understand why it is translated to prefetcht0 instead of prefetcht2 ( see my note for Patrick ).

gcc source code looks as if prefetchnta would be a default; you would have to ask for t0 if you want that, but it may make no difference.  You would still need to look at the source code for your choice of gcc version or test that version.  If you are looking for feedback from gcc people, asking at gcc-help would make more sense.

Leave a Comment

Please sign in to add a comment. Not a member? Join today