Can I Use Multiple Cache Lines for Performance?

Can I Use Multiple Cache Lines for Performance?

I've been reading my cache and processor material. My understanding is that there are multiple cache lines, each wired for specific memory regions. Is that accurate?

Would I be correct in saying that, if I were to distribute my data structure in memory properly I could use multiple cache lines at the same time? Would it happen that each of these cache lines pre-fetches memory as well? I might have misunderstood something, but it seems to me that if I were to use memory regions properly I could effectively have multiple cache lines in the processor at the same time, suggesting that I could use multiple cache lines in my algorithm to increase performance.

Is my understanding accurate?

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

According to the Intel 64 and IA-32 Architectures Software Developer's Manual, in the System Programming Guide, Chapter 10 Memory Cache Control, different memory regions can be assigned different types of cacheing (memory types). Where did you find anything about different cache regions being dedicated to different memory regions? I would think that any cache line/sector can be used for any part of memory (disregarding the caches dedicated to instructions), so you don't need to distribute anything. There is a recommendation not to have more than one lock per cache line/sector, but that is probably only for highly contended locks, and TBB certainly does not stick to it. I'm writing this on an Intel Core Duo laptop computer, which apparently has 512 L1 data cache lines and 32768 L2 unified-cache lines (for data and instructions combined), so while they are limited, they are not nearly as scarse as registers. It is obviously best to put things that are used together temporally also spatially together in the same cache line, but I don't know if you need to concern yourself with much more than that, e.g., I doubt that many programs use PREFETCH instructions. But I am hardly an expert about these things (I even get confused by constants named for line size with values referring to sector size), so please verify for yourself or correct any mistakes (I also have not looked at other processors). For example, to what degree would a processor prefetch memory in a linear access pattern, and how effective would the cache be in not evicting older but more active entries?

I didn't know about those references, they look really useful!

On pg. 113 of "The Software Optimization Cookbook" it says:

"Each row has been designed to hold unique 2-kilobyte or 4-kilobyte blocks of memory, based upon the row number and the memory address, to speed up the time it takes to determine a cache hit. If all 256 lines could hold any memory location, the processor would have to test all 256 lines to determine a cache hit. But, by assigning each row a specific range of possible addresses, the processor tests only eight lines to determine a cache hit."

This lead me to believe that the processor works the way I think it does...

Also, Raf, come onto #tbb sometime, it would be interesting to have some discussions with you!

I don't have that book. I would suppose that, for the L1 data cache on my Intel Core Duo for example, A0-A5 designate the byte within a line (64-byte lines), A6-A11 the row in the 8-way set-associative cache (64 rows of 8 ways is 512 lines or 32 kB), and A12-A31 would be the tag. So each row would correspond to 64 MB of memory (interleaved, 64 bytes at a time), and I don't see what "unique 2-kilobyte or 4-kilobyte blocks of memory" might refer to.

I noticed I should have been more precise in my previous post: "I would think that any cache line/sector can be used for any part of memory" was not meant to imply a fully associative cache, but any 24-kilobyte aligned range of virtual address space (A0-A11) is treated equally to any other by the cache, and at any time it might hold its capacity's worth of contiguous data. That also means that you had better keep your data together, because if you scattered related data at precise 24-kilobyte intervals you would create possible conflicts instead of avoiding them. Maybe some numbers for the L2 unified cache: A6-A17 would be the row (4096 rows of 8 ways is 32768 lines or 2 MB), each row would correspond to 1 MB of memory, and any 128256-kilobyte aligned range is equal to any other.

But if any of that is wrong (please check for yourself!), I would very much like to be educated. And thanks for the invitation, but this forum should be enough for me (provided it doesn't get enough of me!).

(Added) Oops! Any other mistakes?

As Raf suggests, the SPG manuals volumes 1 and 2 are great resources for gleaning details such as these, and at the risk of repeating some of the useful information he passed along, here's an overview of memorycaching (as opposed to some of the other caching done in the processor). Cache types fall in a spectrum from Direct Mapped (where every line in memory has its own cache line) to Fully Associative (where every cache line can hold locally the contents of every memory line). N-Way Set-Associative caches lie somewhere in the middle: each cache line can hold the contents of some set of memory lines, which are evenly dispersed and interleaved through memory, and as noted in the text Adrien quoted, reduces the number of cache lines the processor must examine in order to determine whether there was a hit.. Cache line size may vary per architecture as can the number of ways in the caches (and whether the caches are Write Back or Write Through and other esoteric features). Pentium 4 and Intel Xeon processors have a sectored L2 (and L3 if present) cache, which for all practical purposes means that if adjacent sector prefetch is enabled, a request for one cache line of an associated pair of cache lines (Intel implementations all use 128-byte/ 2 cache-line sectors) will also generate a prefetch for the other cache line in the pair on the speculation that it will be needed eventually anyway. This is one of at least four kinds of hardware prefetch supported in current processors. There are a few specialized cases where application of software prefetch (in the form of an actual instruction in the stream) can hide some memory latency by starting the fetch early, but generally it is better to let the machine figure out when to prefetch, since optimal conditions vary from architecture to architecture.

Isn't sectoring primarily a cheaper means (for tag matching and cache coherence anyway) of providing adjacent cache lines, with prefetch only more likely because it would not evict any more useful cache line, but not automatic (and apparently not even so configurable)? Otherwise I can see no difference between a sector and a line...

Okay, I misunderstood the book. I'll start going through the architecture references that Raf has suggested. In my mind I thought that I could scatter pieces of a data structure through memory to trick the processor into loading parts of that data structure into separate areas of the L1 cache. I then thought that if the processor pre-fetched each line in different areas of memory in parallel, I effectively could have multiple L1 cache data streams for algorithm development.

Thanks everyone for answering my questions, and helping out while I learn these low-level details.

Yes, as you may now realize, your attempts to squeak some extra performance out of the cache may have, depending on how regular was your dispersal, caused an increase of contention between the cache lines, if those locations in VM happened to resolve to the same tag, thus filling up the ways faster.

Raf, this paper might help explain the intent of sectored caches, which does have something to do with reducing tag sizes. Here's another one.

Actually, reducing tag array size (number of tags, because of cost in both die real estate and power consumption of assocatiative lookup and dual-ported snooping) seems more important than reducing tag size, and it is fortunate that, for the same data capacity, doubling sector size reduces tags by 1 bit but the far more important number of tags by half, or turned around it is relatively cheap to add lines to a sector even if their occupancy rate is low. But my question was rhetorical: the intent of sectored caches is to *not* always load/prefetch all the lines in the sector.

Here's a real question, directly relevant to how programmers should reason about locality issues: does false sharing happen between sectors (worse because sectors are bigger) or between lines (same sector happily residing in more than one processor cache, one line dirty in one processor and another line dirty in another processor), and is that always so or dependent on how cache coherency logic is implemented? I didn't read those documents in detail, and so there may be an explicit statement that I have missed, but there are a few places that would lead me to suspect that false sharing could be between 64-byte lines, not between 128-byte sectors, in contradiction with what I think I have heard from TBB. The clearest may be in the second document, at 2.2 C).

Nicest new idea I learned from these documents (the second one, actually): decoupled sectored caches.

Leave a Comment

Please sign in to add a comment. Not a member? Join today