Cache Friendly Data Structures

Cache Friendly Data Structures

Hello everyone,

I'm starting to work on the finer details of YetiSim, and one of the details is memory management. I probably need to revise my design, however for now I am looking at what data structures I can use to increase performance.

I'm looking at the difference between say std::list and std::vector. It seems to be that an implementation of std::list that used a linked list would kill the cache. However, if I pass the cache_aligned_allocator is this sufficient to make the container cache friendly? Are all STL containers cache friendly if they use the cache_aligned_allocator? Which ones should I prefer or avoid ?

Are there negative performance implications if I were to simply override new/delete with the scalable_allocator, and use the cache_aligned_allocator for my containers ?



7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

While it is certainly true that traversing a std::list walks through memory and might cause the displacement of cache lines that have to be recovered later, thus forcing them out prematurely, expanding a std::vector will eventually require it to be relocated in memory, invalidating all cached copies at its old location, and so poses its own problems for optimal cache line reuse. Any design that accumulates data in some structure for later processing is a possible site for cache cooling. Perhaps some of these cases are unavoidable, such as the structures you might consider to represent the simulation network.

Does using cache_aligned_allocator make containers cache friendly? No, it makes them cache line aligned. This can be important if you're doing things like applying vector operations to the contents, where the penalties for unaligned accesses may be severe. It doesn't do much for "cache friendly" concerns, if by that you mean either keeping needed data from going cold through cache line eviction or avoiding cache line thrashing via such mechanisms as false sharing.

If you are doing object creation across a set of threads, using the scalable allocator will improve performance by trying to keep much of the memory block management local to the thread that is using it. Further restricting that to the cache_aligned_allocator gains you the benefits of the scalable allocator (which underlies the cache_aligned_allocator) but only makes sense if there's some specific advantage to aligning the data.

Okay, I've learned something about caching, thanks! I think that there is a critical section of code that is "killing" performance... although initial results are quite promising. The bottle neck is the event dispatcher that dispatches events to be processed. A divide and conqueror algorithm is used to process all events that have been woken up using parallel_reduce, and the join step merges together the changes that the events want to make to the simulation environment. These then get merged back with the MasterScheduler.

The first problem, is that the divide and conqueror algorithm is hitting its worst performance case when simulation time is incremented evenly... I am copying everything out once and copying it back once. Another problem is that I am using std::vector, so I'm going to investigate migrating to concurrent_vector in critical sections... or perhaps globally to measure performance. I've learned from this that one scheduler will not fit all simulations, I will be adjusting my design to factor out the scheduler and allow the user to pop in/out their own schedulers. I'm going to develop a library of schedulers for different simulation types, and let the user select which one they want.

I've heard a lot about VTune, so I'm going to try it today to investigate performance. I think I may need to revise my design to consider caching issues. I migrated to the scalable_allocator, and moved my STL containers to cache_aligned_allocators, but the performance didn't improve much. If you are interested, you can check the code out of SVN and try the sample simulation, run it with make check. Check out 'svn co'. The file src/test.cpp has a number of entities which are created, you can adjust that number higher to see performance.

For now, I will try concurrent_vector and investigate how I can merge simulation results back in parallel, since that is a bottleneck too... it's proceeding serially right now.

Linked structures are generally not friendly to modern processors, because for large data structures that do not fit in cache, following each link usually requires loading a new cache line. Therefore, I always try to get by with vectors instead of lists if I can still get the same asymptotic running time. (I.e., if a linked list delivers O(n) time, and the array is O(n^2), that's a bad deal.) The attraction of linked lists is often that you can append an item in O(1) time. Most implementations of std::vector can append an item in O(1) amortized time, so it usually does just as well asymptotically, and is much friendlier to cache.

A related example that Scott Meyers pointed out to me is searching. He noted that in many cases, binary search over an array beats hashing with linked buckets, because of the cost of following links.

The TBB reference manual has ParallelMerge as an example. If results can be collected separated on each thread, then I would use std::vector instead of tbb::concurrent_vector. tbb::concurrent_vector is of interest if you really do need to have multiple threads accessing and growing the vector at the same time. But to do that, it definitely does more gymnastics than std::vector.

The developer source release for today contains some significant space/speed improvements to tbb::concurrent_vector.

When you mentioned linked structures, I think I found my problem. I am playing with VTune to learn how it works, and my initial hunch that the scheduler is the bottleneck I think is false. I can't say for sure until I can analyze my code with VTune.

Okay, each node in my state machine is a "linked" structure. I'm going to present an oversimplified version of what my node looks like:

class Node
std::vector otherNodes;

In reality, I'm using boost::shared_ptr not Node*. However, each node is linked. The best performance would be obtained if the entire state machine could make it into the cache. I suspect that because each node links to others via pointers, that the problem is that these nodes may be located randomly in memory. Could it be that if I controlled where the Nodes were allocated, that this would improve cache performance?

I'm plan on reading articles on caching today, and examining execution with VTune. Perhaps this is not the problem, but it seems like it might be.


I'd first profile the code carefully before pushing cache optimizations. If you do determine that cache is an issue, here's some things to try:

  • Declare the fields in the node in order of decreasing size. That sometimes reduces padding. Using increasing order works too.
  • If on a 64-bit machine, consider replacing pointers with 32-bit indices into a global array.
  • Walk it in roughtly the order it will be walked when the state machine is run, and copy the nodes into a contiguousarray in that order. I.e., act like a copying garbage collector. An occasional break in the array does not hurt; i.e., if you run out of room, just allocate another array for more space.

- Arch

Alright, I've learned the importance of profiling code... the cache has nothing to do with it... yet. About 50% of execution time is spent on incrementing/decrementing reference counts in boost::shared_ptr. Today I'm going to move over to const references where appropriate, and revise the code base to take this new information into account.

I used gprof to gather this information, I couldn't for the life of me get VTune to work properly on my project. I've posted on the VTune forum about that.

Leave a Comment

Please sign in to add a comment. Not a member? Join today