Can Parallelism Achieve Superlinear Performance Gains?

I just received the March edition of Dr. Dobb's Journal (yes, I still get the printed version). Inside, there is an article by Herb Sutter titled "Going Superlinear". The article posits what at first glance appears to be a rather curious, and perhaps far-fetched, concept: the notion that N processors can produce a performance improvement that exceeds N.

Herb's graph shows the region of performance he's calling "Superlinear":



The "Typical Success" curve is the one all experienced parallel programmers are used to coping with. It's a simple fact, as Herb says, that as more cores/processors are added, eventually:

our code becomes bound on memory or I/O or something else that means diminishing returns for adding more cores.


How is superlinear performance possible?

That indeed was my question as I read the article. I mean, Herb Sutter is a very smart guy, but: isn't a superlinear speedup impossible?

My skepticism remained as I began to read the next section ("Parallel Search: Exploiting Nonuniformity") and scanned the rest of the figures in the article. But then I saw what Herb was saying. Superlinear speedups can't happen in general. However, there are certain types of problems for which a superlinear speedup from adding more processing cores really is possible!

Don't believe it? Well, you should read Herb's article. But I'll also state the gist of it here, mostly using my own thought processes and frame of reference (no sense in merely repeating what Herb has already said well).

The way I see it, there are four characteristics of problems where a superlinear speedup is possible:

    1. an element-by-element pass through the data set is required

    1. computations are not required for all data elements

    1. processing of a specific set of data elements (one or more) completes the analysis

    1. the data elements that result in completion of the analysis are clumped



So, what kind of problems have these characteristics? Herb uses a search of an unsorted collection for a certain value as his example. In a single-threaded application, the logical approach is to just start reading through the entire collection (hence, the problem satisfies condition #1). The analysis ends when a single instance of the searched-for value is found (satisfying conditions #2 and #3).

Why is data clumping required?

Condition #4 is the one you might not expect; but clumped data is actually the key to attaining the superlinear performance Herb is talking about. Again, I'll direct you to Herb's article for a detailed description, while I provide my own brief explanation of how it works here...

Consider a large data set with homogeneously distributed values. In all data regions, there is about the same number of values that match the value we are searching for. In this situation, dividing the data set into N subregions (one for each of our N processing cores) will produce the standard linear speedup (ignoring losses due to threading overhead, etc.).

But what if the distribution of the value that we're searching for is clumped? Then in some of our N subregions there will be a higher density of the value we're searching for than in other subregions. The higher the density of the value we're searching for, the shorter the amount of time it will take for that thread to find a match (on average), thus ending the entire analysis.

Hence, with a clumped data set, the entire analysis will (theoretically, at least) achieve a superlinear speedup when it is run as a multithreaded application to take full advantage of multicore systems.

Conclusion

Now that's cool, and really interesting to think about, don't you think?

Thanks, Herb!

Kevin Farnham, O'Reilly MediaTBB Open Source Community, Freenode IRC #tbb, TBB Mailing Lists

Download TBB

For more complete information about compiler optimizations, see our Optimization Notice.

6 comments

Top
anonymous's picture

I was think about this years ago when someone asked me 'For an Informix OLTP database it is better to have less and faster cpus or more and slower cpus' (4 faster cpus or 8 slower cpus).

One reason for so called 'superlinear' speedup could be that you have more cpu cache for the task. Performance improvement does not neccessarily scale linearly with total cpu cache size. Also for a given task with a given dataset
changing the number of cpu affects the timings when individual threads will be accessing shared resources hence there could be less (or more!) contention for things like memory or i/o buses.

anonymous's picture

I have not seen real superlinear execution. But for some algorithms the runtime time depends on ordering of execution. Multithreading adds some randomness to the execution and you might see speedups between 14.9 and 16.43 for the same input on a 16 prozessor machine.

One can argue that there is a better ordering of execution than it is done by the linear version of the algorithm. But computing a better ordering would involve solving the problem and is therefore not feasible.

Kevin Farnham's picture

Clay: thanks for your comment. I do see the difference you're pointing out between processors and cores, and the idea of having all the data in memory. Most of my multithreading experience (dating back to 1993) has been on multi-processor systems, and indeed the objective was to load as big a chunk of data as possible into memory, process that, then load in the next huge data chunk.

If you read my comment to Christoph, you'll see that I now consider my Condition #1 and Condition #4 to be in conflict. You should read Herb's article. He states the problem, the conditions, and the caveats in a different way.

It was stunning to me to find the idea of "superlinear" speedup talked about. To me, that's akin to saying something can move faster than the speed of light. So, I approach it with some skepticism.

But, I also recognize that engineering is different from pure science, and engineering is what we're really talking about here. I think I'm going to be thinking about this for a while...

Kevin Farnham's picture

Christoph: I have thought a lot about this problem since I wrote this post, and have come to a conclusion similar to what you state in your first paragraph. If it is known that the data is clumped within its ordering in an array or collection, then it the best single-thread algorithm will not be to read and test each element starting from the beginning of the array (or any other location, really). If you know the data is clumped, and you have an estimate of the approximate "size" of the clumps, then you'd want to hop through the data in increments that would make it highly likely that one of your hops will land inside a data clump.

So, clumped data (my Condition #4) in a way contradicts the notion that stepping through the data is the ideal method for solving the problem from a single-threaded point of view. There is a better algorithm than reading element 1, then element 2, then element 3, etc..

Herb Sutter talks about this aspect of looking at things in the last section of his article. At this moment, I think superlinearity probably does imply that the best single-threaded algorithm wasn't being applied.

But you say you've seen examples where you believe you've seen a superlinear speedup -- do you have any links you could point me to?

Clay B.'s picture

Sorry Kevin, but I'm just not getting it from your description. (No offiense, I know how hard it is to recap something technical like this in a much smaller space than the original.) I'm not sure about the four conditions. I'll have to read Herb's article to get more details about his example.

In my past experience, superlinear speedup is more realizable on the quoted phrase you gave: "...our code becomes bound on memory or I/O or something else that means diminishing returns for adding more cores." Once you add enough processors and the entire problem fits into the memory of the system, then you've removed all the memory access and I/O needed when running on a system of fewer processors. Plus, if you have all the data in memory, you want to use it multiple times during the computation. With fewer processors, you end up needing to cycle through the data multiple times and that really gives you the chance for superlinear speedup.

Notice that I said processors. Any time I've encountered superlinear situations in the past was with distributed processors that don't have shared memory. With multi-core processors and shared memory the game will likekly be totally different. (Can we have a problem of such size that it will all fit into cache? Yes, but those size problems are pretty trivial and uninteresting.) I suspect that Sutter's conditions deal with the relevant situation of trying to get superlinear performance in the multi-core case.

anonymous's picture

If you know that the data is clumped, than it makes sense to spread the search in the single threaded case over the whole dataset. For example, divide the data into N buckets and then check the first element of each bucket, then the second one and so on. This would reduce the expected runtime compared to the linear search approach. My gut feeling is that the expected runtime of the multithreaded approach is no longer superlinear.

One non-systematic example of superlinear behaviour can be seen if the duration of computation depends on the ordering of computations and if multthreading adds randomness. Then the speedup factor varies in a smaller or bigger range. If seen several examples, where this range reached into the superlinear region.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.