Linked Lists - Incompatible with Parallel Programming?

By Arch Robison (Intel) (19 posts) on December 20, 2007 at 3:08 pm

I've been asked several times why TBB does not have a concurrent list class; i.e., a list that supports concurrent access. The answer is that we'd add one if:

I usually try to avoid linked lists even for sequential programming if I'm programming for performance. My reasons are:

For parallel programming, add another flaw:

Two traditional attractions of linked lists are:

  1. Linked lists are about the easiest dynamic data structure to write from scratch.
  2. Prepending and appending take O(1) time.

But modern polymorphic language like C++ provide dynamic data structures like std::vector and std::deque. You don't have to write them from scratch. Prepending or appending to a deque also takes O(1) time. Appending to a vector takes O(1) amortized time. Amortized time is the time averaged over many append operations.

Here's a speed test you might want to try. Construct a container, append n items, walk the container once, and destroy it. Here's the code:

template<typename Container>
int Iota( int n ) {
    Container container;

    for( int i=0; i<n; ++i )

        container.push_back(i);

    int sum = 0;

    for( typename Container::const_iterator j=container.begin(); j!=container.end(); ++j )

        sum += *j;

    return sum;

}

I tried this fragment on a Linux box and found that std::deque was slightly faster than std::list when n>=3, and std::vector was slightly faster when n>=10. When n>=100, std::deque was more than 10x faster than std::list, and even std::vector more than 3x faster than std::list. So for very short collections, std::list might pay off. But for big collections, its second-rate.

Of course linked lists do have some virtues, notably when concatentating lists, splicing lists, and inserting items in the middle. I use lists when I need to do that. But getting back to parallelism, which set of those operations make any sense in parallel programming? Concurrent splicing and inserting seems awfully tricky to use correctly. For example, if I really need to insert in the middle of the list, it must be because there is something special about the insertion context. But if there are other threads inserting at the same time, how do I know the context will not be broken?

The two operations on lists that I think could be useful in parallel programming are:

  1. concatenating two lists, in constant time
  2. splitting a list into two sublists, or at least view it as two sublists, in constant time

For example, a parallel reduction could use "concatenate" as its reduction operation, and thus build a list of N items in O(N/P+log(P)) time. The log(P) term arises from a tree reduction at the end. The problem is the second operation. To keep a list from becoming a serial bottleneck, we need a way to traverse it in parallel. That probably means it is no longer a linked list, but some kind of (balanced?) tree structure.

I've had a recurring thought that we should add this kind list, one that supports concatenation and splitting in constant time. But we really need motivating use cases before implementing it. Suggestions for good use cases or demos appreciated.

- Arch Robison

P.S. I though about writing this blog as a politcal attack ad, but given the technical details, decided against it. If I had done it that way, it would have started:

Mr. Linked List is running for office. He's popular everywhere. But here's what Mr. Linked List doesn't want you to know... .

Categories: Parallel Programming, Threading Building Blocks

Comments (16)

December 20, 2007 4:22 PM PST

Kevin Farnham
Kevin Farnham
Performance differences that relate to the number of items to be processed are indeed common. If you search a sorted floating point array for the index of the value that is closest to a given input value, a for(i=0; i<n; i++) search is faster for a small array, but "guessing" the location through interpolation based on bounding values is much faster for a large array. It's easy to see that a small linked list might be efficient. But if the list is very small, the overhead of multithreading will make a multithreaded version less efficient than a single-threaded model, I'd think.

The memory scattering problem with linked lists can also be an enormously significant issue. If the list is constructed from a large number of inserted links, with many other links deleted, the memory locations of "adjacent" list items will often be far apart -- meaning that efficient utilization of cache (which is readily possible with standard arrays or vectors) will be impossible with this kind of linked list.

Still, a lot of people like using linked lists!
December 21, 2007 5:17 AM PST


Daniel Yokomizo
For searching it would probably be better to use a skip list, as it can offer a way to parallelize speculative searching (i.e. one piece with one processor, skip N itens and search the next N piece with another processor, ad infinitum).

If you're looking for a structure with O(1) concatenation, you don't need to look any further than ropes (or cords). They're usually used as string replacements, but we can use the with any element type. They form a very nice algebra and provide some form of locality, as each leaf can be implemented with an array of N elements. As the leaf length is fixed, the cost of arbitrary splits is constant. FWIW here's a possible implementation in a Java:

class Rope {
private static final int CAPACITY = 10;
public Rope concatenate(Rope tail) {
if (this.size() + tail.size() < CAPACITY) {
// avoid creating many short ropes
return new Leaf(this.toArray());
} else {
return new Concatenate(this, tail);
}
}
public static class Leaf extends Rope {
private final E[] elements;
...
public int size() {return elements.length;}
}
public static class Concatenate extends Rope {
private final Rope head;
private final Rope tail;
...
public int size() {return head.size() + tail.size();}
}
}
December 21, 2007 5:25 AM PST


Daniel Yokomizo
Ops, I missed the split implementation:

class Leaf extends Rope {
public Rope[] split(int index) {
return new Rope[]{new Leaf(subArray(elements, 0, i)),new Leaf(subArray(elements, i, elements.length)) };
}
}
class Concatenate extends Rope {
public Rope[] split(int index) {
if (index < head.length) {
Rope[] pieces = head.split(index);
return new Rope[] {pieces[0], pieces[1].concatenate(tail)};
} else {
Rope[] pieces = tail.split(index - head.length);
return new Rope[] { head.concatenate(pieces[0]), pieces[1]};
}
return new Rope[]{new Leaf(subArray(this.elements, 0, i)),new Leaf(subArray(this.elements, i, this.elements.length)) };
}
}

We just need to maintain the rope's balance to have a logarithmic split costs. I misstyped it as constant, which is incorrect anyway.
December 21, 2007 6:57 AM PST


Lou Glassy
To the author - by the way, are you -the- Arch Robison who wrote IFP all those years ago? (If you are, I just wanted to say I tinkered around with that & found it brain-stretching. Thanks kindly.)
December 21, 2007 10:36 AM PST


Dan S.
I'm not so sure. As a matter of fact, I'd have to straight-up disagree. Executive summary: an std::vector/these other "faster" data structures cannot even safely have a single reader and writer working at once. THEY are the ones incompatible with parallel programming. Faster != more parallel.

Consider the lowly singly-linked list, which almost nobody uses in C/C++/Java/C#/Pascal and extended family. You can prepend safely with multiple readers. You can have multiple prependers with a compare-and-swap-or-re-prepend protocol. You can transactionally prepend many items with the same protocol. You can use RCU techniques to a single writer to go wild anywhere in the list without disturbing any readers.

(The same compare-and-swap-or-rebuild protocol will give you that nice transactional feeling with any purely functional data structure, too!) ( ... extended rant about the advantages of programming functionally snipped here ...)

In short you're very wrong! Couldn't be more wrong! Just look at the data structures used in reentrant operating system kernels... the more contention a resource has, the more performance is needed from it, the more likely it is to be stored in a linked list container! Just for good measure, since I seem to be using a lot of these: !!!!!
December 21, 2007 12:25 PM PST

Arch Robison (Intel)
Arch Robison (Intel)
Thanks for the pointer to ropes. Ropes sound like what I'm looking for in the was of a data structure for concatenate and split.

Yes, I'm the author of IFP. It was my master's degree project. It's been a while since I've thought about it. My big regret is not implementing proper tail recursion in it. There's been a lot of compiler/interpreter innovations since I wrote it. If I were to do it over again, I'd write a JIT for it, not an interpreter. Backus's FP, from which IFP was derived, was definitely a different way to think about programming.
December 21, 2007 1:03 PM PST

Arch Robison (Intel)
Arch Robison (Intel)
Yes, it is true that std::vector is not concurrency friendly. But TBB's tbb::concurrent_vector is, and indeed allows multiple readers and writers to work on it. It's not as contiguous as a std::vector, but certainly more contiguous than a std::list. A tbb::concurrent_vector of length N has O(lg(N)) discontinuities in its layout.

I should have defined what I meant by "parallel programming", and distinguished it from "concurrent programming". By parallel programming, I meant programming for parallel speedup, not just the concurrent programming taught in OS classes. OS programming != parallel programming. OS programming is about virtualizing resources, in way that threads do not interfere with other threads. It has be that way, because most of those threads are running independent tasks authored by independent authors. Parallel programming is about orchestrating threads to achieve a common purpose. Indeed, if a parallel program spends most of its time in the kernel, it's probably not getting much speedup. (Parallel I/O is probably the exception to this rule.)

Yes, compare-and-swap protocols can be used to concurrently prepend to a list. Indeed, we use that technique in the implementation of TBB. (Sometimes it takes some care to avoid ABA problems.) But I see compare-and-swap as useful for parallel programming only when the operations can be spread out across many cache lines. If they all contend for a single cache line, it's just another bottleneck. And no matter how clever the construction of a llinked ist, its traversal is inherently serial.

I agree that linked lists are useful for OS and concurrent programming, but I see them as largely (but not completely) useless for parallel programming. That's not to rule out linked structures. Something like ropes sound like a good deal.
December 21, 2007 1:57 PM PST


Greg Buchholz
Data Parallel Algorithms (http://cva.stanford.edu/classes/cs99s/papers/hillis-steele-d.....rithms.pdf)
December 21, 2007 2:44 PM PST

Arch Robison (Intel)
Arch Robison (Intel)
Thanks for the citation. It's an excellent discussion of parallel prefix. TBB in has the parallel prefix operation (tbb::parallel_scan), though I have found it slow in practice because (1) it is a two-pass algorithm and hence uses twice the memory bandwidth of a single pass and (2) TBB was lacking thread affinity support. Problem (2) is being addressed, and so I hope to see some significant improvement once I add the affinity logic to tbb::parallel_scan.

When a machine has a processor per node in a linked list, amazing things can be done. Hillis and Steele's article assume that kind of machine, namely the CM-1 or CM-2, which had up to 2^16 nodes. Cynics will note, however, that Thinking Machines switched to more a conventional MIMD architecture with the CM-5. Coarse-grain MIMD machines seem to dominate the market for now. The processor per list node would seem to be an unreasonable assumption on those kinds of machines. A sublist per node would work.

More recently, Uzi Vishkin has been promoting a very highly parallel "XMT" (http://www.umiacs.umd.edu/~vishkin/XMT/index.shtml) machine that features fast parallel prefix.
December 21, 2007 2:53 PM PST


Dan S.
Thanks for the reply, Arch.

I think we're coming from two rather different perspectives here, as to my mind, parallel is simply a synonym for concurrent.

Perhaps (in my terminology) data-parallelism is a roughly better word for what you are getting at? Something where each thread/process is crunching hard on its own little sub-problem: lots of things happening in parallel but very little contention over the same resources.... In that scenario, optimizing for overall speed, you'd tend towards algorithms and data structures that were as fast as possible within a single-threaded context and then add concurrency primitives on top. First: just enough for safety, then continue increasing the complexity of locking schemes/communications overhead until your problem is (to abuse terminology) compute-bound rather than communications-bound again.

I'd agree with that preference/direction and it seems that's where you've gone with TBB -- which is probably most appropriate in a four-core world.

But it's rather disingenuous to proclaim that because linked lists are less-than-ideal in a scenario like above to claim that they are "Incompatible with Parallel Programming". Especially so as I imagine I'm not alone in thinking "more parallel == more concurrent access", a scenario where linked lists can pull far, far ahead due to the ease of concurrent lock-free access.

((Side note: You're the Intel person, but I'm having a hard time imagining a scenario where CAS is more expensive than explicit locking if you compare them at the same granularity.))

((Another side note: high levels of concurrency/contention are the norm in many real-world applications... databases, web servers, pretty much anything structured as a many-clients service.... ))

((Side side note: Skip lists are a brilliant way of helping some of the asymptotic complexities you're complaining about. And there are more cache-friendly variants of the linked list itself.))
December 21, 2007 4:15 PM PST

Arch Robison (Intel)
Arch Robison (Intel)
Yes, it comes down to concurrent access vs. data parallel programming. I should have said "data parallel" in the title.

One problem we're running into with TBB (and I think other data parallel systems have it too) is how to marry data-parallel programming with OS-style ("many clients") concurrent programming. E.g., someone in the TBB forum has a nice example where agents wait on an external message/event (definitely OS-style), but want process the message using data-parallel style processing.

A single CAS beats a single lock/unlock pair. But a lot of the non-blocking algorithms seem to use more than one CAS. Three summers ago, an intern and I looked into non-blocking algorithms and timed some, specifically dual-queues and skip-lists. The results were disappointing. Unless the machine was grossly oversubscribed with many more software threads running than hardware threads available, the locked algorithms were substantially faster. The problems seemed to arise from the Pentium 4 processor's relatively slow CAS. We also suspected that CAS-based algorithms were causing a lot of cache line thrashing.

I had a similar experience when experimenting with lock-free (or at least lock-reduced) algorithms in the core TBB scheduling algorithm. Cache-line ping-ponging seemed to be a problem. In contrast, locks not only force logical mutual exclusion, they also (if used carefully) tell waiting threads "keep your paws off my cache lines!". When the cache line has to be tranferred across sockets, that could conceivably be a big cost. We came to the conclusion that about 3 CAS was as bad as one lock/unlock pair, at least for short critical sections. (Sounds like geeky poker). The situation is probably somewhat better with Core-2, but I have not done the timings lately. Of course, for grossly oversubscribed situations, non-blocking algorithms are a win because they avoid problems where a lock holder is preempted.

I recall that the CAS-based skip list had similar issues compared to a skip list with a big fat lock around it, at least for the 2-4 core systems we were looking at. Which was depressing. Having written both a red-black tree implementation (for KAI C++ <map> and a skip list (for the guts of the KAI C++ compiler), I'm definitely a fan of the simplicity of skip lists.

Thanks for all the comments. My previous blog on how 'volatile' is almost useless for multi-threaded programming got no comments, so I was wondering if anyone was reading it.
December 24, 2007 1:30 PM PST

AJ
AJTotal Points:
3,752
Status Points:
3,252
Brown Belt
Just as a note, I did read your blog on volatile. I read all of the blogs here, and learn a lot from them.
January 10, 2008 8:58 AM PST


Lance Diduck
I agree with Dan S. Linked lists are a vital part of parallel programming. You seemed to have only considered one possible implementation of linked lists, i.e. that which emulates the behavior of std::list (using the default allocator, which is a known perfomance stealer in this application).
In practice, most linked lists I have seen are intrusive. For example, I have an application where most threads do lookups on a hash, but one thread want to iterate through everything in the hash. hash iterator have horrible performance. Solution: add a pointer to the T, and use that to iterate. This is far more scalable and less complex that making T belong to two independent containers.
Likewise, for heavily IO bound oversubscribed tasks, CAS linked lists blow away every other method I've tried, and it is easy to build lock free stacks , queues, etc out of them.
Not all of us are game programmers, and we use concurrent methods liberally for speeding up things other than the CPU.
January 17, 2008 9:52 AM PST

Arch Robison (Intel)
Arch Robison (Intel)
Thanks for pointing out intrusive lists. I'm a big fan of them - most of the lists inside the KAI C++ compiler were intrusive. The template we had used a pointer-to-member to describe the link. For a list of type T the programmer declared a FrugalSListLink in their structure, say foo_link, and declared the list as FrugalSList<T,&T::foo_link> . For our new programmers, it was often their first introduction to the pointer-to-member feature in C++.

I agree that my original posting narrowly addressing speeding up wallclock time for classic supercomputing-type CPU-intensive applications, and not addressing other important areas of concurrency.
June 2, 2008 1:13 AM PDT

Andrey Marochko (Intel)
Andrey Marochko (Intel)
Another tribute to the intrusive lists. Recently we've added two more of them into the TBB scheduler as part of the new cancellation propagation machanism. One is the list of master thread scheduler objects, and the other is the list of task group contexts. Using them allowed us to remove locks from the normal (that is hot) execution path.
June 7, 2009 2:28 AM PDT

jaredkeithwhite
jaredkeithwhiteTotal Points:
375
Status Points:
325
Green Belt
Arch,

I realize I'm a bit late to this discussion. However, I'd like to add one side note to the discussion and perhaps see where you feel my concerns stand within the roadmap of TBB.

While it is true that linked lists are a bad idea as far as data parallelism goes, given their inherently serial nature, I think that often, people are looking to use a concurrent linked list for a completely different reason altogether aside from the usual benefits of a linked list. The real glaring oversight in TBB isn't necessarily a missing concurrent linked list. It's the ability to delete from the concurrent vector!

I have an implementation that I am working on that has the following concurrent execution requirements:
- very frequent iteration
- occasional add/delete
- memory is cheap and I don't mind buying more
- lists are fairly small (almost none have more than 10 items)

By "very frequent iteration", I mean extremely frequent. 200-300 or so times per second.
By "occasional add/delete", this could be anywhere from once every 2 seconds. The real time constraints of the add/delete operation are a little less stringent, because of other optimizations that I've done. For example, the iteration process will ignore an item if it's about to be deleted, even if it hasn't been deleted yet.

I understand the obvious reasons why concurrent_vector doesn't support deletes. It's extremely difficult to implement and would cause performance degredation for most other operations on the collection. One particular way that I've considered implementing this is with element reuse within a vector. It would allow for instances such as mine, where I have frequent iteration and infrequent add/delete. For folks with infrequent iteration and frequent add/delete, another implementation might be required. Additionally, this implementation requires that order of elements be unimportant.
- obviously this is a pretty narrow edge case, but please bear with me, I'll address other cases in a minute
- this implementation would require that concurrent_vector be non-intrusive (I'm not sure if it's intrusive or not in its current form).

Let's say you have a concurrent_vector with 6 integers, 0-5. And let's say you want to delete element 3, also value 3. What we could do is push the element INDEX onto a concurrent_queue, and mark the element as deleted. The iterators would skip over it naturally.

Start: 0, 1, 2, 3, 4, 5, 6
erase[3]: 0, 1, 2, x, 4, 5, 6 -- and the concurrent_queue has index #3 on it.

Now, let's say that we have a new value to push onto the array. A new method, "push_next" would "pop_if_present" from the concurrent queue to see if an element index was available for reuse. If one was discovered, it would use that element. Otherwise, push_back would be invoked. In the event we want to add the value "20" to the list:

push_next[20]: 0, 1, 2, __ 20 __, 4, 5, 6 -- notice we reused the element #3.

Another implementation might allow for a delete, which would simply set the value wrapper to "deleted". In essence, it's a soft-delete. Subsequent invocations of "compact" would remove these items, which would obviously require a pretty firm write lock.

Do you have any thoughts I might draw upon? My current implementations use a concurrent_hash_map (because it's the only container that currently supports "erase"), but I'm a bit concerned about iteration performance, to say the least.

Regards,
-Jared

Trackbacks (0)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*