I just received the March edition of Dr. Dobb's Journal (yes, I still get the printed version). Inside, there is an article by Herb Sutter titled "Going Superlinear". The article posits what at first glance appears to be a rather curious, and perhaps far-fetched, concept: the notion that N processors can produce a performance improvement that exceeds N.
Herb's graph shows the region of performance he's calling "Superlinear":
The "Typical Success" curve is the one all experienced parallel programmers are used to coping with. It's a simple fact, as Herb says, that as more cores/processors are added, eventually:
our code becomes bound on memory or I/O or something else that means diminishing returns for adding more cores.
How is superlinear performance possible?
That indeed was my question as I read the article. I mean, Herb Sutter is a very smart guy, but: isn't a superlinear speedup impossible?
My skepticism remained as I began to read the next section ("Parallel Search: Exploiting Nonuniformity") and scanned the rest of the figures in the article. But then I saw what Herb was saying. Superlinear speedups can't happen in general. However, there are certain types of problems for which a superlinear speedup from adding more processing cores really is possible!
Don't believe it? Well, you should read Herb's article. But I'll also state the gist of it here, mostly using my own thought processes and frame of reference (no sense in merely repeating what Herb has already said well).
The way I see it, there are four characteristics of problems where a superlinear speedup is possible:
- an element-by-element pass through the data set is required
- computations are not required for all data elements
- processing of a specific set of data elements (one or more) completes the analysis
- the data elements that result in completion of the analysis are clumped
So, what kind of problems have these characteristics? Herb uses a search of an unsorted collection for a certain value as his example. In a single-threaded application, the logical approach is to just start reading through the entire collection (hence, the problem satisfies condition #1). The analysis ends when a single instance of the searched-for value is found (satisfying conditions #2 and #3).
Why is data clumping required?
Condition #4 is the one you might not expect; but clumped data is actually the key to attaining the superlinear performance Herb is talking about. Again, I'll direct you to Herb's article for a detailed description, while I provide my own brief explanation of how it works here...
Consider a large data set with homogeneously distributed values. In all data regions, there is about the same number of values that match the value we are searching for. In this situation, dividing the data set into N subregions (one for each of our N processing cores) will produce the standard linear speedup (ignoring losses due to threading overhead, etc.).
But what if the distribution of the value that we're searching for is clumped? Then in some of our N subregions there will be a higher density of the value we're searching for than in other subregions. The higher the density of the value we're searching for, the shorter the amount of time it will take for that thread to find a match (on average), thus ending the entire analysis.
Hence, with a clumped data set, the entire analysis will (theoretically, at least) achieve a superlinear speedup when it is run as a multithreaded application to take full advantage of multicore systems.
Now that's cool, and really interesting to think about, don't you think?
Kevin Farnham, O'Reilly MediaTBB Open Source Community, Freenode IRC #tbb, TBB Mailing Lists