How to avoid cache fill on write miss?

How to avoid cache fill on write miss?

I'm developing a realtime video processing system on a dual Pentium 4 Xeon platform. I'm running into memory bandwidth limitations, and would like to know if there's any solution to poor cached write performance. Basically, in processing a stream of video, I'm producing and using lots of intermediate results, referencing & updating a model, and outputting some small amount of metadata.

In the course of doing all of this, I'm allocating lots of temporary buffers and writing into them. They are read one or more times a short while later and deallocated (we're using an efficient allocator that keeps buffers on a "stack" to maximize the chance that the memory is still cached). What concerns me is that in writing these new buffers, I'm often getting cache misses which I understand should result in cache fills - IN SPITE OF THE FACT THAT I'M ABOUT TO OVERWRITE THE ENTIRE CACHE LINE! This is basic copy-back cache behavior, and it results in write performance slightly less than half what the external memory subsystem could support.

I can't afford to use non-temporal stores, because I typically want the write to be cached, in case I access that data soon enough (which I often do). However, I've found no other write strategy that seems to avoid the cache fill, if the address isn't already in cache.

Please tell me there's some way to "allocate" a cache line (the Philips TriMedia processor has such an instruction) or some way to use write-combining buffers to do cached writes!

The cacheline allocation approach (referred to as the "zalloc instruction") is discussed in a bit more detail, towards the end of the following post:
http://groups.google.com/groups?q=zalloc+instruction+glew&hl=en&lr=&ie=U...

I see that cache line allocation could cause code to have nasty model-specific dependencies on cacheline sizes - even if you can query it via a special register, some programmers may be lazy and write their code to assume a certain cache line size. If the cacheline in a later model is larger, that same code might experience memory corruption.

However, providing some way to use write combining buffers for cached writes seems like a safe, deterministic approach that would definitely solve my problem.

Message Edited by videocoder on 08-02-2004 09:00 PM

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Greetings fromIntel Software NetworkSupport.

We're running this by our Application Engineering team. We'll let you know how they respond.

Regards,

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 11-30-2005 04:17 PM

Hi,

This is just a followup to let you know our engineering team is still working on this.

Regards,

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 11-30-2005 04:17 PM

Here is the response we received from our Application Engineers:

A basic feature of the cache on the Intel Pentium 4 processor is that memory locations are assigned to cache lines by reading in those memory locations. This behavior is based on the assumption that regardless of whether the operation is a read or a write, there is a high probability that other locations on that same cache line will be read by the processor very shortly. There are, of course, exceptions to the rule, and your application appears to be one of them. Based on your description, there appears to be little that can be done at the application level to improve performance.

From your description of the buffer allocator used in your application, however, it would appear that there may be some room for improvement there. If the buffer allocator were to deliberately allocate buffers that were large enough to satisfy any buffer request, then the same buffer could be reused over and over again, maximizing the probability that the buffer locations would remain in the cache. This, of course, could be very wasteful if most of the buffers requested are considerably smaller than the maximum size. In general, reducing the number of physical buffers used, and making them as large as is reasonable. Your allocator should use a LIFO algorithm when assigning buffers and it looks like the "stack" you describe does just that. A buffer request that is smaller than the last buffer used could still use that buffer to maximize efficiency. In short, allow larger buffers than required when it makes sense.

The number of buffers in use at any one time should be kept as small as possible, and you should also avoid using buffers simultaneously whose physical addresses share the same lower 16 bits. This last caution is not an issue with the latest Pentium 4 processors, but can cause some performance loss with earlier versions. We wish you the best of luck in your efforts.

Regards,

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Message Edited by intel.software.network.support on 11-30-2005 04:18 PM

Thanks for the comprehensive response! I'm glad to be reassured that I'm not missing something or misunderstanding this aspect of the cache/memory subsystem.

As memory latencies continue to increase (in cycles), so will the cost of this deficiency. I hope that Intel will consider addressing it in future products by employing some sort of transparent mechanism similar to write-combining buffers, for cached writes.

Leave a Comment

Please sign in to add a comment. Not a member? Join today