Why a simple test can get parallel slowdown

 Those who read Russian may follow this link.

 With multicore processors becoming widespread, parallel programming moves to mainstream. As indirect evidence, recently there are seen many attempts to develop a simple parallel benchmarking test to see performance benefits from multicore and multithreading. And with no doubt this is good, but…

 I have observed a few such attempts where parallel code used the Threading Building Blocks library (TBB). Much to experimenters’ astonishment, not only their simple parallel programs sometimes expose no reasonable speedup but even those can be slower than sequential counterparts! And unfortunately, not every program author could (and wants to) understand why.

 Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing about these cases. Those on the one hand are simple and comprehensible; most developers do not start their parallel experiments with complex algorithms but with something simple like a loop to sum all array elements. On the other hand, these examples can as well be interesting.

 The array summation is a good example to start :). The question about why so simple a program became slower with TBB had arisen at the TBB forum (see “performance of parallel_for” there). Here is relevant part of the code:


#define REAL double 
#define MAX 10000000
class SumFoo {
REAL *a;
REAL sum;
...
void operator()(const blocked_range<size_t>& r) {
REAL *arr = a;
for(size_t i=r.begin(); i!=r.end(); i++) sum+=arr[i];
}
};
void ParallelSumFoo(REAL a[], size_t n, size_t gs) {
SumFoo sf(a,n);
parallel_reduce(blocked_range<size_t>(0,n,gs), sf);
...
}
class SerialApplyFoo {
REAL *a;
size_t len;
double sum;
public :
SerialApplyFoo(REAL *array, size_t length) : a(array), len(length){
sum = 0.0;
for(size_t i=0; i<len; i++)
sum += a[i];
}

};
int main() {
task_scheduler_init init;
REAL *array = new REAL[MAX];
...
SerialApplyFoo sf(array, MAX);
ParallelSumFoo(array, MAX, GRAINSIZE);
...
}


On a dual core x86 system such as one running with an Intel® Core® 2 Duo processor, instead of expected 2x speedup, ParallelSumFoo was almost twice slower than SerialApplyFoo, e.g. on my laptop it ran for 0.08 sec vs 0.044 in serial. Let’s see what was bad there.

 First, let’s examine what time the TBBfied code takes to run sequentially. There are two ways for that; one is to explicitly specify the number of threads when creating task_scheduler_init object for library initialization, and the other is to set the grain size of parallel_reduce equal to the array size. I used the first way:

    task_scheduler_init init(1);



And the result was kind of shocking: 0.21 sec, which is almost 5 times slower than the original serial code! Now no wonder that on two cores it was also slow; and TBB can’t be the direct reason of such slowdown.

 The root of the issue is in optimization by compilers. The serial loop is very simple to optimize; the compiler knows all about it, from the constant length to the fact that sum should only be visible after the constructor of SerialApplyFoo completes. So it can generate very efficient machine code for this loop; for example, Visual* C++ 2005 compiler produced the following:


            fldz 
lea eax, DWORD PTR [edi+010h]
mov ecx, 0x2625A0h
main+0x105: fadd QWORD PTR [eax-16]
add eax, 0x20h
sub ecx, 0x1h
fadd QWORD PTR [eax-40]
fadd QWORD PTR [eax-32]
fadd QWORD PTR [eax-24]
jnz main+0x105


The magic 0x2625A0 constant is 2500000, one fourth of the array length. The compiler applied loop enrolling optimization and replaced four iterations by one, thus decreasing overhead in the loop. The intermediate sum is accumulated in a register and only stored to memory once at the end (not shown). And there is exactly one memory load (of the needed array element) per iteration of the original loop.

 The TBBfied code is not that simple for the compiler. Start and end of the loop are specified in data fields of an object passed by reference; loop length is unknown; the result is stored in a data field of an active object. The compiler has to be more conservative:


execute+0x57: fld      QWORD PTR [edi] 
add ecx, 0x1h
fadd QWORD PTR [edx+08h]
add edi, 0x8h
fstp QWORD PTR [edx+08h]
cmp ecx, DWORD PTR [esi+08h]
jnz execute+0x57


In this code, there are 3 memory loads (!) and 1 store to process just one array element. Besides the element itself, the loop end value is loaded every time, as well as the intermediate sum which is then stored back only to be reloaded at the next iteration.

 I wrote the above considerations in the forum, but to bad luck I said that the benchmark “favors” the serial case; by that I probably pointed the author to wrong direction. He tried to improve the test by reading the array length from console (which is good because it makes the test more realistic) and replaced SerialApplyFoo object to the following function:


REAL SerialSum(REAL *a_, unsigned long int start, unsigned long int end) { 
REAL sum = 0.0;
for(unsigned long int i=start; i<end; i++)
sum += a_[i];
return sum;
}


But this change has next to no effect because now the intermediate sum is accumulated in a local variable and compiler uses a register for it with the same ease as before, and this is good. Slowing down the serial code shouldn't be the goal; to solve the problem, the parallel code should be improved to simplify optimizations for the compiler; namely, local variables should be used in the loop in operator():


    void operator()(blocked_range<size_t> &r) { 
REAL local_sum = 0;
const size_t end = r.end();
for(size_t i=r.begin(); i!=end; i++)
local_sum+=a[i];
sum += local_sum;
}


 This is the corresponding machine code I got:


              fldz 
execute+0x58: fadd qword ptr [edx]
add edx,8
sub ecx,1
jne execute+58h
fadd qword ptr [edi+8]
fstp qword ptr [edi+8]


Like in the serial code, there is just one memory load per iteration, the intermediate sum is accumulated in a register and then added to the sum field of SumFoo (see the last two instructions). And it significantly improved the speed: now ParallelForSum completes in 0.033 sec if both cores are used, so the speedup of 1.33 is achieved. But why not 2x, you might ask. Well, it might be a subject of another post.

 Conclusion: when developing programs with TBB, you should take into account that using TBB classes and functions may impact compiler optimizations, which has especially bad impact on simple algorithms with small amount of work per iteration. Proper use of local variables helps optimization and improves parallel speedup.

For more complete information about compiler optimizations, see our Optimization Notice.