_mm_stream() on cached data

_mm_stream() on cached data

I have a question regarding the behavior of _mm_stream_ps (movntps) when the target address is already in the cache.

Does it simply write in the cache or does it schedule a non-temporal store?

The issue arrises when doing in-place processing on large amounts of data.

In order to hide the memory latency as much as possible, I'm using _mm_prefetch to asynchronously load the data in the cache. But since the processing is done in-place, I'm wondering if I should use streaming or regular stores when writing back the new data. I know that movntps is typically used to reduce cache pollution
when writing uncached data, but what if the memory was previously cached using prefetching? What type of stores should be used when doing non-temporal in-place

I would expect that movntps frees the cache line containing the previous data and directly writes the new data to memory but is it really what it is doing?

In this situation, are there any advantages in using non-temporal stores over regular

Any differences between PIII and P4?



12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I certainly wouldn't have a complete answer to this, even if I understood fully what you're aiming for.

If you're trying to speed up things by a streaming store, I don't see why you would prefetch. On the other hand, if you are intending to let your data be cached, I don't see why you don't just write straight vectorizable C code.

Yes, there could be a difference between P-III and P4. If you're writing large blocks of contiguous data on P4, other than by streaming store, hardware prefetch will kick in automatically, and software prefetch could only slow you down. If you try to prefetch just to cover the initial misses while hardware prefetch starts up, that might work on one stepping, but might simply delay hardware prefetch on another. I'd guess that either the prefetch or the streaming store, but not both at once, is more likely to help you on P-III.

As Pentium-M has hardware prefetch, my first guess would be to stick with plain C there too, although the software prefetch may be more efficient than on P4. Pentium-M hasn't figured strongly in my experiments. I mention it partly to point out that you don't want architecture-specific optimizations which are more likely to hold you back in the future than to help you in the present.

I've tried the streaming store on P4, in hopes of circumventing the limited availability of Write Combine buffers, with negative results. If possible, you should avoid writing multiple blocks of data in the same loop (more than 2, in some situations), so that the WC buffers work for rather than against you. Streaming store didn't appear to be a useful way of circumventing the limit.

Ok, let me explain a bit more.

I am processing very large amounts of data (several megabytes), so in order to avoid spilling the cache and to hide the latency of the memory, I use the non-temporal prefetch a couple iterations in advance.

The data is not necessarily contiguous (it can be stridden or even indexed by a table), so I doubt that the P4's hardware prefetch can be of any help.

Once the data is in the cache, it is read, processed, and written back again in the same location. I was hoping that the non-temporal stores would help by telling the processor that the data will be be used again (so I don't want it to stay in the cache).

It's interesting to note that non-inplace processing of large buffers appears to be faster than inplace processing!!! (even though there is twice as much memory trafic).

When doing not-inplace processing, the prefetching/non-temporal stores give me as much as a 4 fold increase in performance on the PIII, while for inplace, I loose a bit of performance.

My conclusion is to avoid cache friendly primitives for in-place processing (even if the amount of data is extremely large).


On my 1GHz PIII I get

Sorry for the typo:

> I was hoping that the non-temporal stores would help ?> by telling the processor that the data will be be used > again (so I don't want it to stay in the cache).

Should be:

I was hoping that the non-temporal stores would help by telling the processor that the data will _NOT_ be used again (so I don't want it to stay in the cache).

Above data seems strange. While I don't know your memory access pattern, even simple linear in-place operations show the speedup of 1.2x to 1.3x when used with prefetchnta/movntps or movntq, both on P!!! 800MHz and P4 2.4B. With less regular access patterns this should be even higher. Check if TLB priming, prefetchnta, operation and movntps are all present. Moreover, you may be using either too large prefetched data block or mixing loads/prefetches with stores/streams, which is also known to hinder performance. Prefetch block, operate, stream it back and sfence. Please tell if the problem persists then.

As for WC buffers - yes, I read all the manuals a couple of times; I do not use more than 2 buffers at once, I use them in linear order, so they should be flushed properly. However, my guess is that this looks more like WC flush activating with every write (starting L1->L2 transaction) with the only exception of first two or three writes to a cache line that are handled together, because of WC not starting immediately.
Please tell me if above guess is right or wrong. More importantly, if you know anyone that can shed a light on that and show a workaround to write 16 (or even 8) bytes every cycle on average to L1 and/or L2, please talk them into writing about it here.

Last but certainly not least: when testing in-place operations on memory on P4, turning off hardware prefetch allows for additional 6-12% speedup; not surprising (less often interleaved reads/writes), but unfortunately impossible to use in real life conditions.

Best regards,

Shihjong asked me to add his comments:

Minimizing the number of writes is an excellent point.

Havning said that, Some of the salient facts about streaming stores and regular writes are

1 if you do regular writes to writeback (WB) memory, it is possible to achieve ~ 1/3 of the theoretical bus bandwidth, if you managed to avoid pitfalls in alignment, store-forwarding, aliasing, and using no more than 4 simultaneous write streams. (By minimizing the number of stores, you need to worry less.) With in-place processing, extra care is needed to avoid pifalls of not observing the store-fowarding restrictions and to avoid aliasing problems. Strive for good spatial locality (i.e. cache blocking) with processing large of amounts of data is a must, to begin with.

2. If you plan to do streaming stores to WB memory, it can achieve twice the throughput, compared to regular writes, the big catch is, if you use it properly. If you don't, you'll be worse off than regular writes (instead of 2/3 of the bus b/w, you'll be working with 1/12 of that.) To achieve the highest streaming store data rate, you'll have to stream out 64 bytes suquentially, alignd to cache line boundary. (This allows 4 16byte stores combined into a single 64 byte bus write transaction). If you stream out 16 bytes at a time without paying attention to WC buffers, each 16B store results in two partial bus writes.

3. One way to deal with this requirement maybe
(1) prefetch your, non-regular access pattern to a properly-sized cacheable buffer block,
(2) do your in-place processing
(3) streaming out your aligned buffer linearly in full 64 bytes until depleting you buffer
(4) Repeat from (1)

Check out the latest IA32 Optimization Manual.

Further comments in this thread:


I am in finishing touches of writing Application Note for P4
on memory usage; a few code samples are going to be provided, showing
real speed-up on some real-life problems. Any information on even
better cache usage - and on what can be said publicly - is invaluable.

Information for "Shihjong":
+ 1/3 bus bandwidth on regular writes seems right, aliasing important.
+ 2/3 seems wrong, I did 93.6% on 845G + IG + 333DDR 2.5-633
looks like more is possible with separate GPU and better timings.


There is a distinction between FSB b/w and b/w of the memory subsystem. Looks like customer achieved 94% of the single-channel DDR memory b/w (2.7GB/sec). This would agree with 2/3 of the FSB b/w (4.2 GB/s) being 2.8 GB/s. Nice job.
about WCBs
[See pg 2-48 and 7-39]

Thanks; you're right, I thought about memory bandwidth. However, if FSB limit does exist, it has to be much higher than 2/3, e.g. this would suggest 81%+ for 533MHz FSB and 74%+ for 800MHz FSB (check SiSoft* Sandra* 2003 RAM Buffered scores). These scores are consistent with other sites, and as for Sandra* itself, my own streaming bechmarks agree with it - at least with their "memory" scores - quite closely (+/-2%) on many systems, so I am going to believe it.

> about WCBs
> [See pg 2-48 and 7-39]
[this was from IA32 Optimization Reference Manual]
I've read these pages a few times before. My comment about lack of WCBs information was not because of lack of ANY data, but because of no usable information about circumventing WCBs' limits. Thanks for the pages, though; it is never enough about Optimization Manual, the more people will read it, the better.

Best regards, Anna
P.S. Big thanks to Timothy C Prince for helping in communication :)

I was wondering how you got +/- 2% with SiSandra Buffered scores. I am trying to duplicate the results using my code for assignment (memcopy) and my results are much worse. on a RDRAM 800 (860 based - 3.2 GB/s peak) system, I am not able to get more than 2.1 GB/s (Copy bandwidth) whereas SiSandra gets around 2.5 GB/s. I have tried the Example from the IA-32 optimization manual and only get around 1GB/s from that. I get the best performance from just using the intel compiler on the standard stream benchmark and that is around 2.1 GB/s. Anna, could it be possible to see a copy of your streaming benchmark?
Thanks a lot for any help.


There are differences between P!!! and P4. However, if data that you want to overwrite is in the cache and the total is smaller than cache size, in most cases they behave similarly - they evict it after streaming store. What you can do is:
1) Unless total size of data is larger than cache size, change streaming stores into normal ones.
2) Minimize number of writes.

P4 has significant problem dealing with cache writes, despite the theoretical information. If anyone can give reasonable explanation on why WC buffers are slowing down any writes to over 4 cache lines, it will be highly appreciated.
Moreover, if anyone knows a way to achieve more than 4.5 bytes per cycle with P4 during any type of writes on more than 512B of data, I would be more than happy. Please don't hesitate to post.


P4 hardware initiates a flush of the least recently used Write Combine buffer when there are not at least 2 clean buffers. Since you are working with 6 WC buffers, you can't stably use more than 4, without premature flush being initiated automatically. Presumably, this allows you to get a new WC buffer for each logical processor immediately upon filling a buffer, and touching the new buffer initiates the write to memory of a buffer. Cache line splits (unaligned writes) apparently could force the wrong buffer to be flushed.

Hi tim.

Can you tell me where to find the application notes about memory usage?

And there is one more question. After reading the discussions above, I have a feeling that using _mm_stream_pd or _mm_stream_si128 may performs even worse than plain C code. Is it correct? I myself feel that it is hard to accept. _mm_clflush should performs better. But it is kind of annoying to align to 64bytes every time. Data processedy my program can not even garanteed to align to 2 bytes.

Leave a Comment

Please sign in to add a comment. Not a member? Join today