Why a simple test can get parallel slowdown

 Those who read Russian may follow this link.

 With multicore processors becoming widespread, parallel programming moves to mainstream. As indirect evidence, recently there are seen many attempts to develop a simple parallel benchmarking test to see performance benefits from multicore and multithreading. And with no doubt this is good, but…

 I have observed a few such attempts where parallel code used the Threading Building Blocks library (TBB). Much to experimenters’ astonishment, not only their simple parallel programs sometimes expose no reasonable speedup but even those can be slower than sequential counterparts! And unfortunately, not every program author could (and wants to) understand why.

 Being one of the TBB developers at Intel, in some cases I had to analyze what was wrong with these codes. And since we were told that TBB developers should blog more, I eventually decided to try it out, and I will start with writing about these cases. Those on the one hand are simple and comprehensible; most developers do not start their parallel experiments with complex algorithms but with something simple like a loop to sum all array elements. On the other hand, these examples can as well be interesting.

 The array summation is a good example to start :). The question about why so simple a program became slower with TBB had arisen at the TBB forum (see “performance of parallel_for” there). Here is relevant part of the code:


#define REAL double 
#define MAX 10000000
class SumFoo {
REAL *a;
REAL sum;
...
void operator()(const blocked_range<size_t>& r) {
REAL *arr = a;
for(size_t i=r.begin(); i!=r.end(); i++) sum+=arr[i];
}
};
void ParallelSumFoo(REAL a[], size_t n, size_t gs) {
SumFoo sf(a,n);
parallel_reduce(blocked_range<size_t>(0,n,gs), sf);
...
}
class SerialApplyFoo {
REAL *a;
size_t len;
double sum;
public :
SerialApplyFoo(REAL *array, size_t length) : a(array), len(length){
sum = 0.0;
for(size_t i=0; i<len; i++)
sum += a[i];
}

};
int main() {
task_scheduler_init init;
REAL *array = new REAL[MAX];
...
SerialApplyFoo sf(array, MAX);
ParallelSumFoo(array, MAX, GRAINSIZE);
...
}


On a dual core x86 system such as one running with an Intel® Core® 2 Duo processor, instead of expected 2x speedup, ParallelSumFoo was almost twice slower than SerialApplyFoo, e.g. on my laptop it ran for 0.08 sec vs 0.044 in serial. Let’s see what was bad there.

 First, let’s examine what time the TBBfied code takes to run sequentially. There are two ways for that; one is to explicitly specify the number of threads when creating task_scheduler_init object for library initialization, and the other is to set the grain size of parallel_reduce equal to the array size. I used the first way:

    task_scheduler_init init(1);


And the result was kind of shocking: 0.21 sec, which is almost 5 times slower than the original serial code! Now no wonder that on two cores it was also slow; and TBB can’t be the direct reason of such slowdown.

 The root of the issue is in optimization by compilers. The serial loop is very simple to optimize; the compiler knows all about it, from the constant length to the fact that sum should only be visible after the constructor of SerialApplyFoo completes. So it can generate very efficient machine code for this loop; for example, Visual* C++ 2005 compiler produced the following:


            fldz 
lea eax, DWORD PTR [edi+010h]
mov ecx, 0x2625A0h
main+0x105: fadd QWORD PTR [eax-16]
add eax, 0x20h
sub ecx, 0x1h
fadd QWORD PTR [eax-40]
fadd QWORD PTR [eax-32]
fadd QWORD PTR [eax-24]
jnz main+0x105


The magic 0x2625A0 constant is 2500000, one fourth of the array length. The compiler applied loop enrolling optimization and replaced four iterations by one, thus decreasing overhead in the loop. The intermediate sum is accumulated in a register and only stored to memory once at the end (not shown). And there is exactly one memory load (of the needed array element) per iteration of the original loop.

 The TBBfied code is not that simple for the compiler. Start and end of the loop are specified in data fields of an object passed by reference; loop length is unknown; the result is stored in a data field of an active object. The compiler has to be more conservative:


execute+0x57: fld      QWORD PTR [edi] 
add ecx, 0x1h
fadd QWORD PTR [edx+08h]
add edi, 0x8h
fstp QWORD PTR [edx+08h]
cmp ecx, DWORD PTR [esi+08h]
jnz execute+0x57


In this code, there are 3 memory loads (!) and 1 store to process just one array element. Besides the element itself, the loop end value is loaded every time, as well as the intermediate sum which is then stored back only to be reloaded at the next iteration.

 I wrote the above considerations in the forum, but to bad luck I said that the benchmark “favors” the serial case; by that I probably pointed the author to wrong direction. He tried to improve the test by reading the array length from console (which is good because it makes the test more realistic) and replaced SerialApplyFoo object to the following function:


REAL SerialSum(REAL *a_, unsigned long int start, unsigned long int end) { 
REAL sum = 0.0;
for(unsigned long int i=start; i<end; i++)
sum += a_[i];
return sum;
}


But this change has next to no effect because now the intermediate sum is accumulated in a local variable and compiler uses a register for it with the same ease as before, and this is good. Slowing down the serial code shouldn't be the goal; to solve the problem, the parallel code should be improved to simplify optimizations for the compiler; namely, local variables should be used in the loop in operator():


    void operator()(blocked_range<size_t> &r) { 
REAL local_sum = 0;
const size_t end = r.end();
for(size_t i=r.begin(); i!=end; i++)
local_sum+=a[i];
sum += local_sum;
}


 This is the corresponding machine code I got:


              fldz 
execute+0x58: fadd qword ptr [edx]
add edx,8
sub ecx,1
jne execute+58h
fadd qword ptr [edi+8]
fstp qword ptr [edi+8]


Like in the serial code, there is just one memory load per iteration, the intermediate sum is accumulated in a register and then added to the sum field of SumFoo (see the last two instructions). And it significantly improved the speed: now ParallelForSum completes in 0.033 sec if both cores are used, so the speedup of 1.33 is achieved. But why not 2x, you might ask. Well, it might be a subject of another post.

 Conclusion: when developing programs with TBB, you should take into account that using TBB classes and functions may impact compiler optimizations, which has especially bad impact on simple algorithms with small amount of work per iteration. Proper use of local variables helps optimization and improves parallel speedup.

For more complete information about compiler optimizations, see our Optimization Notice.

15 comments

Top
Bartlomiej's picture

Yes, you don't have to access the global variable each time.
I was wrong - I'm sorry.

Alexey Kukanov (Intel)'s picture

I do not get why you have an impression that each addition has to be synchronized. No it has not.
Most additions are local with the result collected in SumFoo::sum, and then results from different threads are combined by parallel_reduce in SumFoo::join() method (that I omitted), with no explicit synchronization again. All necessary synchronization is done by TBB, is limited, and distributed (no central point of contention). It seems I should have made it more clear, though it is not directly to the point of the post.

Bartlomiej's picture

I'm sorry, I didn't realize that it's a TBB class, not just a made-up example.
But again if each addition is synchronized, parallelizing a loop that does additions (only or almost only) cannot give speedup, regardless of other things. That's how it seems to me.

Alexey Kukanov (Intel)'s picture

Thank you for the comment bartolomiej.
Actually, accessing SumFoo members from its function call operator requires no synchronization - it is guaranteed by TBB that this method is never called concurrently for the same instance of SumFoo. So the example is "adequate".

Bartlomiej's picture

I know I'm late, but just read this interesting post and I'd like to thank you for it.
However I have one comment - as far as I understood the example is actually not very adequate as operations on a shared variable sum should be synchronized somehow, so using a local variable is crucial also not to lock on a mutex so many times.
Still, the example is clear and inspiring.

Alexey Kukanov (Intel)'s picture

Regarding replacing r.end() with a local variable: in another similar case I checked just recently it helped the compiler to recognize the loop is vectorizable, which also lead to significant performance improvement.

anonymous's picture

On my computer(s) an even more important issue is bandwidth.
If I replace the simple sum with a more involved calculation (that uses
only the values already fetched from memory) I quickly approach ideal
speedup. An important bottleneck seems to be memory, of which there is only
one!

Arch D. Robison (Intel)'s picture

Replacing r.end() does not help further because local_sum is the only variable written by the loop. Since the compiler can tell that local_sum has no aliases, it can safely hoist the other reads out of the loop.

Alexey Kukanov (Intel)'s picture

Jim, I am not sure I fully understand what your question is. Are you asking if you can do such processing with TBB? I think yes. Or, do you wonder if these algorithms suite better as a benchmark? Probably yes; also there should be more computations per memory operation I believe. Or, are you asking something different?

anonymous's picture

This is cool stuff, but I do have a question. You are processing a tight loop many times over using TBB in parallel, but what if you want to process a longer algorithm a few times in parallel. Something like an FFT on block data or the fast hadamard algorithm?

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.