Why Parallel Performance Results Don't Matter...much

Let me state up front that the opinions expressed here do not necessarily reflect the views of Intel Software Network or the editorial policies of the ISN Parallel Programming Community.  It's just me and my bias.

I was reading a paper a few days ago that described how the author took 4 numeric computations, parallelized them, and ran them on different numbers of cores.  Three of the seven pages were taken up by speedup and other performance graphs. One whole page of text described the parallel execution results and how they compared. 

Ho hum.

The article was written well and the justification for the choice of the four algorithms to be parallelized was well defended. However, I was more interested to know how the parallelization of the code was accomplished. Beyond a paragraph or so talking about this, there was nothing.

I hope that by now we all "get" that parallel codes are going to run faster on multi-core processors. I don't need to see those kinds of performance gains as the focus of an article anymore. What I want to know is how it was done.  What programming methodology/library did you choose and why?  Were there any problems that you didn't anticipate?  How did you solve them?  Did you try a different method of parallelization and how did that turn out?  Show me some of the parallelized code from your application (but describe it in enough detail that I might recreate it if the code weren't included). Parallel speedup/performance results are still needed, but don't let them hog the spotlight.

There are exceptions to this latter piece of advice, of course.  If you get some unexpected results and can explain why (and better still how to fix it) or you've been able to test the application on lots of cores/threads and are interested in why the app does (or why it doesn't) scale as anticipated or you're testing different approaches to the parallelism, then I'd be more interested in hearing about the performance you achieved with your parallel code(s). Otherwise, for me, I would much rather learn from your coding experience about what works, how you thought up your approach, and what it was like to implement that approach in code. Seeing something new or a twist on an old method is how we all learn.

Back in the day, when NP-Complete theory was shiny and new, you could publish articles in conferences and journals that were simply proofs of the NP-Complete nature of a defined problem.  After a while, when more people became familiar with concepts and the proof methods, the luster wore off and these papers weren't published much. For me the luster of showing off the speedup of parallel applications has worn off.

As I stated at the outset, this is just my preference for what I read in articles about parallel programming. I'm not trying to change the way you or any other author writes up their scholarly results for ISN or any other outlet. But, if you get back review comments asking for more time on methodology and less on results, it might have come from me. I hope you'll seriously consider taking the advice.
For more complete information about compiler optimizations, see our Optimization Notice.


@andm - Some interesting and relevant points. I'm not in complete agreement with your claim that no consumer will need a cluster in their pocket. I look to what tech was available and on the horizon 10 years ago. Back then, would we have been able to foresee the iPad, tablets, ubiquitous WiFi, FaceBook and Twitter? Reading some of the SF from 5 decades ago and what they imagined computers to be in the 21st Century or beyond seems laughable by today's tech standards, while there are others that have still not been realized.

In 10 years will we have an AI assistant in our pocket that requires a clusters worth of cores to run? Cloud computing is a big topic right now and such facilities ahve the computation power that such an AI might need. But, as you note, if there are network bandwidth issues, it would be much better/easier/faster to have the processing power in the palm of my hand and not need to rely on the cloud.

I don't know what comptuation my nieces and nephew might be utilizing as a daily part of their lives when they graduate college in 10 years or so. I'm not even sure I could begin to attempt to imagine it, either. If the pace of technological advances over the last five or ten years is any indication, soemone in a coma today that woke up in that time might think we had all become wizards.

Why Parallel Programming Doesn't Matter...much

Twenty years ago when blindingly fast 100 MHz processors were in the horizon the visionaries of the time stretched their collective imaginations and speculated what kind of applications could use that seemingly infinite processing power. Voice recognition, rendering maps in 3D, playing music and video simultaneously all seemed within reach. Someday, they said, we would carry 100 MHz processors in our pockets. And here we are, 20 years later, and the phone in my pocket runs at 800 MHz has a fast GPU, has I don't know how many radios in it (cell phone, Edge, 2G, 3G, 4G, Bluetooth, Wifi, GPS receiver) a camera with focus, and flash, and who knows what else. Is there multi-threaded software running in it? I don't know, but I do know there is a lot of parallelism... lots of pixels computed in parallel by the GPU, lots of different specialized components running in parallel. At the risk of being short sighted, I'll claim that for high volume products the parallelism will be built into custom hardware. Of course Moore's projection of shrinking transistors and correspondingly exponential growth in number of transistors per die continues to hold true for the time being. So we can pack many cores onto a chip... but I fail to see how that will go into consumer devices. Soon we will have a "cluster on a chip"... but individual consumers don't need a cluster in their pocket, or a cluster in their living room. Companies that have hundreds of servers will be happy... Google, Amazon, eBay, Facebook, Yahoo! etc. and the High Performance Computing community, of course. But a lof of this is just running the same number of serial programs with fewer physical boxes or blades. The parts of a serial program that are intrinsically parallel are often limited. The first step to making a fast parallel version is to simply optimize the serial version. If you are going to put a lot of effort into making it faster, you might as well get it as efficient as possible in it's sequential form. Once you do that you may find that disk I/O is a bottleneck or the network is a bottleneck or memory bandwidth is a bottleneck. If you solve all that, and you still have an intrinsically parallel task then you can focus on making that parallel. But how many "killer applications" are there that will drive volume sales of many core processors? For the average business and home users, we finally have reached "fast enough" for the CPU. Maybe I'll live long enough to see how short-sighted I was in this post.

@Thomas - I have an almost opposite story. During the course of years, I've attended an internal conference. It seemed that there were an overwhelming number of papers that would go thorugh all the details of optimization and use of tools and other code modifications only to end up with a small percentage improvement, usually no more than 10%. (None of these were parallel solutions back then.) It got so bad that I was tempted to ask what the conclusions were going to be so that I could better judge if it was worth spending the 25 minutes of lecture to get to the results.

@yinongchen - That sounds like a great idea. To learn some new programming idea or skill, it pays to work through a solution and then attempt to apply the desired methods to a imilar problem. Even giving a written explanation to some code, it is important to verify that the a student understands what is being conveyed or demonstrated.

Knowing the audience is always important. One cannot present a paper in a conference just like teaching an undergraduate class. On the other hand, undergraduate students would not understand the concepts without doing a hands-on assignment. I always provide a complete set of code to support the major concepts. Students must read the code and run the code, and then, use the code as the basis to development their own code to solve the problems.

Clay, I think I'm getting your point: A few years ago, I was attending a workshop about multi-core programming. In one of the talks, a parallel implementation was presented that was 3 times faster than the serial implementation, on a dual-core system! The presenter didn't even understand why I asked for an explanation for this result. I totally agree that we don't need such papers published.


My take is it's all too leading edge for business where the need for parallelization is high and since most of it is driven by concurrency and not CPU cycles, different ballgame. For my scene, learned F# for cloud on Azure.

An architect the on-demand datastores by need have to be transformed from typical sources if you want any kind of concurrency so I'm authoring a framework for that, to me it's new so I called it the On-demand Data Tier, fed by a Data Transform Layer that takes any source and puts it onto the tier (beyond what most ETL tools are doing), this all is async-parallel to the std. Azure config'd sources, since I'm using SilverLight dynamic property changes are concern, add it all up and it's base class stuff, but things are moving along ... using the Intel tools when I can for multi-core, feeding things with stacks, about to add in caching.

And since nobody knows squat about F#, I can't get a job!!! This after learning I can paste VB code-behind into F# and get it running in seconds for async+parallel, way cool.


Being mainly concerned with workflow, connecting the need & use of parallelization to reality in a user's workflow is a bit trickier to architect, the deal being concurrency is driving the need for async & parallelization all over the app so I'm working on frameworks.

These are to work in a cloud ... using F# for this, Microsoft's only declarative language with VB.Net code-behind as it pastes into F# for async+p and more, Intel tools, Azure, SilverLight, Live for collaboration along with db's, memory-mapped files all of which get transformed by a new layer I'm adding to create cloud forms of datastores that are on-demand, or, BI style columns for example, calling it the Data Transformation Layer, it feeds the On-demand Data Tier.

This is all beside the normal Azure entity & resource configuration, used for the first hit by a user, that fires off the transform layer to move up that user's data to on-demand form and it's used for data from then on for their session, with concurrency this is a mess to get down for parallelization but obviously worth the trouble.

My take is the scene is leading-edge and not many dev's are really doing anything with it, at TechEd2010 only 3 people were using F# in production (vs1010 was the first licensed release, F#2.0), only about 70 people showed up that were even using it at some level and I'm authoring base classes and frameworks because nothing's out there for this kind of cloud work.

Another bonuse is that I can't get a job because everyone wants C# code-gurus to handle the mess it makes when an imperative language coded by dev's without a clue about declarative hits the cloud and they need to debug it, whoa baby, back away and let them play, the time wasted is huge ... &-)

@sprovidence - But don't just show me code. I hate that even more. Show me code, but only add it to enhance the description of what was done. Don't just say "The computation was made parallel. Figure X shows the parallel code."

I've read too many papers in the past 2-3 years that expect the reader to be able to wade through 20-50 lines of code and understand what was done, what the author was thinking, and decode what variables are being used for what purpose just from the name. When I encounter such examples I feel like Linus, who is reading "The Brothers Karamozov," as he tells his sister Lucy that "when I come to [a Russian name] I can't pronounce, I just bleep right over it!"

I agree with you that the way the parallelization is done is the most important fact, as there are layers of scheduling and allocation mechanisms between the program and the cores. I recently implemented a multi-threading algorithm to calculate the Collatz conjecture (Half Or Triple Plus One). I assign different portions to different threads, and the improvement is immediately visible.
Compared to the single thread program, the average Speedup = 3.01 and the average efficiency = 75% on a four core computer.
Below are the description and the program in C#.
The Collatz conjecture is an unsolved conjecture in mathematics named after Lothar Collatz, who first proposed it in 1937. 
Take any natural number n. If n is even, divide it by 2 to get n / 2, if n is odd multiply it by 3 and add 1 to obtain 3n + 1. Repeat the process until the result is 1.
The conjecture is that no matter what number you start with, you will always reach 1 – the program always terminated.
Nobody can prove if the conjecture is true or false.
We will write a multithreading program to validate the Collatz Conjecture and evaluate the performance of the multithreading program.
Design a single thread program running on a 4-core computer
Design a 4-thread program running on a 4-core computer
Compute the speedup based on average results of repeated execution
Compute the efficiency of the multi-core
using System; using System.Threading;
namespace Collatz {
class HOTPO {
Int64 s, t;
public HOTPO(Int64 start, Int64 terminate) {
this.s = start;
this.t = terminate;
public void hotpoFunc() { //Collatz conjecture: Half Or Triple Plus One
for (Int64 i = s; i <= t; i++) {
Int64 n = i;
while (n > 1) {
if (n % 2 == 0) // if n is even
n = n / 2; // Integer division
n = 3 * n + 1;
} } } }
class Program {
static void Main(string[ ] args) {
Int64 n = 10; // repeat n times to build the average
for (Int64 r = 1; r <= 20; r++) { // r is the scalability factor
Int64 sum = 0;
Int64 m = r * 2500; // 1/4 of the max number to validate
Int64 t = m * 4; // The max number to validate
Console.Write("The program validate the HOTPO function for numbers ");
Console.WriteLine("from 1 To " + 4 * m);
try {
Int64 s = 1;
HOTPO h1 = new HOTPO(s, t); // Use a single thread
for (int i = 0; i < n; i++) {
// for (int i = 0; i < n; i++) {
Int64 s = 1;
HOTPO h1 = new HOTPO(s, t); // Use a single thread
for (int i = 0; i < n; i++) {
Thread ht1 = new Thread(new ThreadStart(h1.hotpoFunc));
DateTime start1 = DateTime.Now;
while (ht1.IsAlive) { }; // do not measure the time till thread terminates
Int64 ht1Time = (DateTime.Now - start1).Milliseconds;
sum = sum + ht1Time;
Console.WriteLine(i + ": Time consumed by single thread in milliseconds is " + ht1Time);
Console.WriteLine("Average time consumed by single thread in milliseconds is " + sum / n);
Int64 s41 = 1;
Int64 t41 = m;
Int64 s42 = m + 1;
Int64 t42 = 2 * m;
Int64 s43 = 2 * m + 1;
Int64 t43 = 3 * m;
Int64 s44 = 3 * m + 1;
Int64 t44 = 4 * m;

sum = 0;
HOTPO h41 = new HOTPO(s41, t41); // For thread 1
HOTPO h42 = new HOTPO(s42, t42); // For thread 2
HOTPO h43 = new HOTPO(s43, t43); // For thread 3
HOTPO h44 = new HOTPO(s44, t44); // For thread 4

for (int i = 0; i < n; i++) {
Thread ht41 = new Thread(new ThreadStart(h41.hotpoFunc));
Thread ht42 = new Thread(new ThreadStart(h42.hotpoFunc));
Thread ht43 = new Thread(new ThreadStart(h43.hotpoFunc));
Thread ht44 = new Thread(new ThreadStart(h44.hotpoFunc));
DateTime start4 = DateTime.Now;
ht41.Start(); ht42.Start(); ht43.Start(); ht44.Start();
while (ht41.IsAlive || ht42.IsAlive || ht43.IsAlive || ht44.IsAlive) { };
Int64 ht4Time = (DateTime.Now - start4).Milliseconds;
sum = sum + ht4Time;
Console.WriteLine(i + ": Time consumed by four threads in milliseconds is " + ht4Time);
Console.WriteLine("Average time consumed by four threads in milliseconds is " + sum / n);
finally { }
} } }