Linked Lists - Incompatible with Parallel Programming?

I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access. The answer is that we'd add one if:

    • We could figure out semantics that are useful for parallel programming and

    • We could implement it reasonably efficiently on current hardware.



I usually try to avoid linked lists even for sequential programming if I'm programming for performance. My reasons are:

    • Linked lists are often misused in ways that create asymptotic slowness. Usually the culprit is searching the list. Yes, similar abuses can occur for other data structures too. But lists seem to attract this abuse.

    • Linked lists are unfriendly to cache. Adjacent items in a list tend to be scattered in memory. With cache misses costing on the order of 100x a cache hit, this can be a significant performance issue.



For parallel programming, add another flaw:

    • Traversing a linked list is inherently serial. [Theorists will point out that traversal can be done in parallel if you have a processor per node in the list. Feel free to buy that many Intel processors -- I own Intel stock.]



Two traditional attractions of linked lists are:

    1. Linked lists are about the easiest dynamic data structure to write from scratch.

    1. Prepending and appending take O(1) time.



But modern polymorphic language like C++ provide dynamic data structures like std::vector and std::deque. You don't have to write them from scratch. Prepending or appending to a deque also takes O(1) time. Appending to a vector takes O(1) amortized time. Amortized time is the time averaged over many append operations.

Here's a speed test you might want to try. Construct a container, append n items, walk the container once, and destroy it. Here's the code:

template<typename Container>

int Iota( int n ) {

    Container container;

for( int i=0; i<n; ++i )

container.push_back(i);

int sum = 0;

for( typename Container::const_iterator j=container.begin(); j!=container.end(); ++j )

sum += *j;

return sum;

}


I tried this fragment on a Linux box and found that std::deque was slightly faster than std::list when n>=3, and std::vector was slightly faster when n>=10. When n>=100, std::deque was more than 10x faster than std::list, and even std::vector more than 3x faster than std::list. So for very short collections, std::list might pay off. But for big collections, its second-rate.

Of course linked lists do have some virtues, notably when concatentating lists, splicing lists, and inserting items in the middle. I use lists when I need to do that. But getting back to parallelism, which set of those operations make any sense in parallel programming? Concurrent splicing and inserting seems awfully tricky to use correctly. For example, if I really need to insert in the middle of the list, it must be because there is something special about the insertion context. But if there are other threads inserting at the same time, how do I know the context will not be broken?

The two operations on lists that I think could be useful in parallel programming are:

    1. concatenating two lists, in constant time

    1. splitting a list into two sublists, or at least view it as two sublists, in constant time



For example, a parallel reduction could use "concatenate" as its reduction operation, and thus build a list of N items in O(N/P+log(P)) time. The log(P) term arises from a tree reduction at the end. The problem is the second operation. To keep a list from becoming a serial bottleneck, we need a way to traverse it in parallel. That probably means it is no longer a linked list, but some kind of (balanced?) tree structure.

I've had a recurring thought that we should add this kind list, one that supports concatenation and splitting in constant time. But we really need motivating use cases before implementing it. Suggestions for good use cases or demos appreciated.

- Arch Robison

P.S. I though about writing this blog as a politcal attack ad, but given the technical details, decided against it. If I had done it that way, it would have started:

Mr. Linked List is running for office. He's popular everywhere. But here's what Mr. Linked List doesn't want you to know... .
For more complete information about compiler optimizations, see our Optimization Notice.

16 comments

Top
jaredkeithwhite's picture

Arch,

I realize I'm a bit late to this discussion. However, I'd like to add one side note to the discussion and perhaps see where you feel my concerns stand within the roadmap of TBB.

While it is true that linked lists are a bad idea as far as data parallelism goes, given their inherently serial nature, I think that often, people are looking to use a concurrent linked list for a completely different reason altogether aside from the usual benefits of a linked list. The real glaring oversight in TBB isn't necessarily a missing concurrent linked list. It's the ability to delete from the concurrent vector!

I have an implementation that I am working on that has the following concurrent execution requirements:
- very frequent iteration
- occasional add/delete
- memory is cheap and I don't mind buying more
- lists are fairly small (almost none have more than 10 items)

By "very frequent iteration", I mean extremely frequent. 200-300 or so times per second.
By "occasional add/delete", this could be anywhere from once every 2 seconds. The real time constraints of the add/delete operation are a little less stringent, because of other optimizations that I've done. For example, the iteration process will ignore an item if it's about to be deleted, even if it hasn't been deleted yet.

I understand the obvious reasons why concurrent_vector doesn't support deletes. It's extremely difficult to implement and would cause performance degredation for most other operations on the collection. One particular way that I've considered implementing this is with element reuse within a vector. It would allow for instances such as mine, where I have frequent iteration and infrequent add/delete. For folks with infrequent iteration and frequent add/delete, another implementation might be required. Additionally, this implementation requires that order of elements be unimportant.
- obviously this is a pretty narrow edge case, but please bear with me, I'll address other cases in a minute
- this implementation would require that concurrent_vector be non-intrusive (I'm not sure if it's intrusive or not in its current form).

Let's say you have a concurrent_vector with 6 integers, 0-5. And let's say you want to delete element 3, also value 3. What we could do is push the element INDEX onto a concurrent_queue, and mark the element as deleted. The iterators would skip over it naturally.

Start: 0, 1, 2, 3, 4, 5, 6
erase[3]: 0, 1, 2, x, 4, 5, 6 -- and the concurrent_queue has index #3 on it.

Now, let's say that we have a new value to push onto the array. A new method, "push_next" would "pop_if_present" from the concurrent queue to see if an element index was available for reuse. If one was discovered, it would use that element. Otherwise, push_back would be invoked. In the event we want to add the value "20" to the list:

push_next[20]: 0, 1, 2, __ 20 __, 4, 5, 6 -- notice we reused the element #3.

Another implementation might allow for a delete, which would simply set the value wrapper to "deleted". In essence, it's a soft-delete. Subsequent invocations of "compact" would remove these items, which would obviously require a pretty firm write lock.

Do you have any thoughts I might draw upon? My current implementations use a concurrent_hash_map (because it's the only container that currently supports "erase"), but I'm a bit concerned about iteration performance, to say the least.

Regards,
-Jared

Andrey Marochko's picture

Another tribute to the intrusive lists. Recently we've added two more of them into the TBB scheduler as part of the new cancellation propagation machanism. One is the list of master thread scheduler objects, and the other is the list of task group contexts. Using them allowed us to remove locks from the normal (that is hot) execution path.

Arch D. Robison (Intel)'s picture

Thanks for pointing out intrusive lists. I'm a big fan of them - most of the lists inside the KAI C++ compiler were intrusive. The template we had used a pointer-to-member to describe the link. For a list of type T the programmer declared a FrugalSListLink in their structure, say foo_link, and declared the list as FrugalSList&lt;T,&amp;T::foo_link&gt; . For our new programmers, it was often their first introduction to the pointer-to-member feature in C++.

I agree that my original posting narrowly addressing speeding up wallclock time for classic supercomputing-type CPU-intensive applications, and not addressing other important areas of concurrency.

anonymous's picture

I agree with Dan S. Linked lists are a vital part of parallel programming. You seemed to have only considered one possible implementation of linked lists, i.e. that which emulates the behavior of std::list (using the default allocator, which is a known perfomance stealer in this application).
In practice, most linked lists I have seen are intrusive. For example, I have an application where most threads do lookups on a hash, but one thread want to iterate through everything in the hash. hash iterator have horrible performance. Solution: add a pointer to the T, and use that to iterate. This is far more scalable and less complex that making T belong to two independent containers.
Likewise, for heavily IO bound oversubscribed tasks, CAS linked lists blow away every other method I've tried, and it is easy to build lock free stacks , queues, etc out of them.
Not all of us are game programmers, and we use concurrent methods liberally for speeding up things other than the CPU.

AJ's picture

Just as a note, I did read your blog on volatile. I read all of the blogs here, and learn a lot from them.

Arch D. Robison (Intel)'s picture

Yes, it comes down to concurrent access vs. data parallel programming. I should have said "data parallel" in the title.

One problem we're running into with TBB (and I think other data parallel systems have it too) is how to marry data-parallel programming with OS-style ("many clients") concurrent programming. E.g., someone in the TBB forum has a nice example where agents wait on an external message/event (definitely OS-style), but want process the message using data-parallel style processing.

A single CAS beats a single lock/unlock pair. But a lot of the non-blocking algorithms seem to use more than one CAS. Three summers ago, an intern and I looked into non-blocking algorithms and timed some, specifically dual-queues and skip-lists. The results were disappointing. Unless the machine was grossly oversubscribed with many more software threads running than hardware threads available, the locked algorithms were substantially faster. The problems seemed to arise from the Pentium 4 processor's relatively slow CAS. We also suspected that CAS-based algorithms were causing a lot of cache line thrashing.

I had a similar experience when experimenting with lock-free (or at least lock-reduced) algorithms in the core TBB scheduling algorithm. Cache-line ping-ponging seemed to be a problem. In contrast, locks not only force logical mutual exclusion, they also (if used carefully) tell waiting threads "keep your paws off my cache lines!". When the cache line has to be tranferred across sockets, that could conceivably be a big cost. We came to the conclusion that about 3 CAS was as bad as one lock/unlock pair, at least for short critical sections. (Sounds like geeky poker). The situation is probably somewhat better with Core-2, but I have not done the timings lately. Of course, for grossly oversubscribed situations, non-blocking algorithms are a win because they avoid problems where a lock holder is preempted.

I recall that the CAS-based skip list had similar issues compared to a skip list with a big fat lock around it, at least for the 2-4 core systems we were looking at. Which was depressing. Having written both a red-black tree implementation (for KAI C++ &lt;map&gt; and a skip list (for the guts of the KAI C++ compiler), I'm definitely a fan of the simplicity of skip lists.

Thanks for all the comments. My previous blog on how 'volatile' is almost useless for multi-threaded programming got no comments, so I was wondering if anyone was reading it.

anonymous's picture

Thanks for the reply, Arch.

I think we're coming from two rather different perspectives here, as to my mind, parallel is simply a synonym for concurrent.

Perhaps (in my terminology) data-parallelism is a roughly better word for what you are getting at? Something where each thread/process is crunching hard on its own little sub-problem: lots of things happening in parallel but very little contention over the same resources.... In that scenario, optimizing for overall speed, you'd tend towards algorithms and data structures that were as fast as possible within a single-threaded context and then add concurrency primitives on top. First: just enough for safety, then continue increasing the complexity of locking schemes/communications overhead until your problem is (to abuse terminology) compute-bound rather than communications-bound again.

I'd agree with that preference/direction and it seems that's where you've gone with TBB -- which is probably most appropriate in a four-core world.

But it's rather disingenuous to proclaim that because linked lists are less-than-ideal in a scenario like above to claim that they are "Incompatible with Parallel Programming". Especially so as I imagine I'm not alone in thinking "more parallel == more concurrent access", a scenario where linked lists can pull far, far ahead due to the ease of concurrent lock-free access.

((Side note: You're the Intel person, but I'm having a hard time imagining a scenario where CAS is more expensive than explicit locking if you compare them at the same granularity.))

((Another side note: high levels of concurrency/contention are the norm in many real-world applications... databases, web servers, pretty much anything structured as a many-clients service.... ))

((Side side note: Skip lists are a brilliant way of helping some of the asymptotic complexities you're complaining about. And there are more cache-friendly variants of the linked list itself.))

Arch D. Robison (Intel)'s picture

Thanks for the citation. It's an excellent discussion of parallel prefix. TBB in has the parallel prefix operation (tbb::parallel_scan), though I have found it slow in practice because (1) it is a two-pass algorithm and hence uses twice the memory bandwidth of a single pass and (2) TBB was lacking thread affinity support. Problem (2) is being addressed, and so I hope to see some significant improvement once I add the affinity logic to tbb::parallel_scan.

When a machine has a processor per node in a linked list, amazing things can be done. Hillis and Steele's article assume that kind of machine, namely the CM-1 or CM-2, which had up to 2^16 nodes. Cynics will note, however, that Thinking Machines switched to more a conventional MIMD architecture with the CM-5. Coarse-grain MIMD machines seem to dominate the market for now. The processor per list node would seem to be an unreasonable assumption on those kinds of machines. A sublist per node would work.

More recently, Uzi Vishkin has been promoting a very highly parallel "XMT" (http://www.umiacs.umd.edu/~vishkin/XMT/index.shtml) machine that features fast parallel prefix.

anonymous's picture

Data Parallel Algorithms (http://cva.stanford.edu/classes/cs99s/papers/hillis-steele-data-parallel-algorithms.pdf)

Arch D. Robison (Intel)'s picture

Yes, it is true that std::vector is not concurrency friendly. But TBB's tbb::concurrent_vector is, and indeed allows multiple readers and writers to work on it. It's not as contiguous as a std::vector, but certainly more contiguous than a std::list. A tbb::concurrent_vector of length N has O(lg(N)) discontinuities in its layout.

I should have defined what I meant by "parallel programming", and distinguished it from "concurrent programming". By parallel programming, I meant programming for parallel speedup, not just the concurrent programming taught in OS classes. OS programming != parallel programming. OS programming is about virtualizing resources, in way that threads do not interfere with other threads. It has be that way, because most of those threads are running independent tasks authored by independent authors. Parallel programming is about orchestrating threads to achieve a common purpose. Indeed, if a parallel program spends most of its time in the kernel, it's probably not getting much speedup. (Parallel I/O is probably the exception to this rule.)

Yes, compare-and-swap protocols can be used to concurrently prepend to a list. Indeed, we use that technique in the implementation of TBB. (Sometimes it takes some care to avoid ABA problems.) But I see compare-and-swap as useful for parallel programming only when the operations can be spread out across many cache lines. If they all contend for a single cache line, it's just another bottleneck. And no matter how clever the construction of a llinked ist, its traversal is inherently serial.

I agree that linked lists are useful for OS and concurrent programming, but I see them as largely (but not completely) useless for parallel programming. That's not to rule out linked structures. Something like ropes sound like a good deal.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.