I'm developing a realtime video processing system on a dual Pentium 4 Xeon platform. I'm running into memory bandwidth limitations, and would like to know if there's any solution to poor cached write performance. Basically, in processing a stream of video, I'm producing and using lots of intermediate results, referencing & updating a model, and outputting some small amount of metadata.
In the course of doing all of this, I'm allocating lots of temporary buffers and writing into them. They are read one or more times a short while later and deallocated (we're using an efficient allocator that keeps buffers on a "stack" to maximize the chance that the memory is still cached). What concerns me is that in writing these new buffers, I'm often getting cache misses which I understand should result in cache fills - IN SPITE OF THE FACT THAT I'M ABOUT TO OVERWRITE THE ENTIRE CACHE LINE! This is basic copy-back cache behavior, and it results in write performance slightly less than half what the external memory subsystem could support.
I can't afford to use non-temporal stores, because I typically want the write to be cached, in case I access that data soon enough (which I often do). However, I've found no other write strategy that seems to avoid the cache fill, if the address isn't already in cache.
Please tell me there's some way to "allocate" a cache line (the Philips TriMedia processor has such an instruction) or some way to use write-combining buffers to do cached writes!
The cacheline allocation approach (referred to as the "zalloc instruction") is discussed in a bit more detail, towards the end of the following post:
I see that cache line allocation could cause code to have nasty model-specific dependencies on cacheline sizes - even if you can query it via a special register, some programmers may be lazy and write their code to assume a certain cache line size. If the cacheline in a later model is larger, that same code might experience memory corruption.
However, providing some way to use write combining buffers for cached writes seems like a safe, deterministic approach that would definitely solve my problem.
Message Edited by videocoder on 08-02-2004 09:00 PM