Performance degradation
This is about the performance degradation that I find in my application with changes in the tbb::task usage. I
have made two sample programs which best replicate the cahnges I made to my application program. Most importantly it
also replicates the performance degradation I find in my application as well.
// Serial Fibonnaci sum long SerialFib( long n ) { if( n < 2) return n; else return SerialFib(n-1) +
SerialFib(n-2); }
static const int CutOff = 16; // Parallel Fibonnaci sum using tbb::task struct FibTask: public tbb::task{ public: long n; long x, y; long* sum; bool
is_continuation;
FibTask( long n_, long* sum_ ) : n(n_), sum(sum_), is_continuation(false), x(0),
y(0) {}
tbb::task* execute() { if( is_continuation ) { *sum = x+y; return
NULL; } else { if( n<CutOff ) { *sum = SerialFib(n); return NULL; }
else { FibTask& a = *new(allocate_child()) FibTask( n-2, &x); FibTask& b =
*new(allocate_child()) FibTask( n-1, &y); recycle_as_continuation(); is_continuation = true; //
Set ref_count to "two children". set_ref_count(2); spawn( b ); return &a; } }
} };
long ParallelFib( long n ) { long sum; FibTask& a =
*new(tbb::task::allocate_root()) FibTask( n, &sum); tbb::task::spawn_root_and_wait(a); return sum; }
Here is the main program : Main Program 1
int main( int argc, char *argv[]) { long n = 36; int nrTask = 1; // ( 10, 100, 1000) std::cout
<< "Computing Fib( " << n << " )" << std::endl;
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now(); for( int i = 0; i < nrTasks; ++i) long result = ParallelFib(
n); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << "Result :: " << result << "
Time Taken :: " << (t1 - t0).seconds() << std::endl; return 0; }
The results of main
program 1 on a Quad core machine with nrTasks varied from 1, 10, 100, 1000 are as follows
nrTasks Total Time taken( in seconds) 1 0.071 10 0.70 100 6.99 1000 69.83
But if I change the program as
follows : Main Program 2
static const int cacheLineSize
= 64; static const int JumpFactor = cacheLineSize / sizeof(long);
int main( int argc, char *argv[]) { long n = 36; int nrTasks = 1; //(10, 100, 1000)
std::cout << "Computing Fib( "
<< n << " )" << std::endl; std::cout << "Number of Tasks submitted by input thread => "
<< nrTasks << "." << std::endl; long* pSums = reinterpret_cast<long*>(
tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks *
cacheLineSize); TaskSchedulerInit init; tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask(); a.set_ref_count(1); for( int i = 0; i <
nrTasks; ++i) { FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i *
JumpFactor]); a.spawn( b); } a.wait_for_all(); a.destroy( a); tbb::tick_count t1 =
tbb::tick_count::now(); std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl; long sum = SerialFib( n); for( int i
= 0; i < nrTasks; ++i) { if( sum != pSums[i * JumpFactor]) { std::cout << "Error in the Fibnocci
Calculation for task " << i << std::endl; throw; } } std::cout << "Validation
Passed" << std::endl; tbb::cache_aligned_allocator<char>().deallocate(
reinterpret_cast<char*>(pSums), 0); return 0; }
The results of main program 2 are as
follows
nrTasks Total Time taken( in seconds) 1 0.080 10
0.80 100 8.015 1000 80.164
Now the results of the
main program 1 are as expected. However the results of main program 2 suffer some performance degradation. It looks like
per task overhead that is causing the degradation.(or is it because of something else.?)
The difference in
main program 1 and main program 2 is that in Program 1 the main thread waits until each task is complete and in Program
2 the main thread just spawns all the tasks and then waits on all the tasks to complete.
Are there ways where
the main program 2 could be made to perform as main program 1, keeping the fact that the main thread should spawn all
the tasks before it can starts working on the taks spawned?
| |
Re: Performance degradation
What happens if you use parallel_for instead in the second program?
| |
Re: Performance degradation
What happens if you use parallel_for instead in the second program?
here is the third program using parallel_for( I hope this was what you meant).
class ApplyTasks
{ public: void operator()( const tbb::blocked_range<size_t>& r ) const { for( size_t i =
r.begin(); i != r.end(); ++i ) sum[i * JumpFactor] = ParallelFib(n); }
ApplyTasks( long n_,
long* sum_) : n(n_), sum(sum_) {}
long n; long* sum; };
int main( int argc,
char *argv[]) { long n = 36; int nrTasks = 1; //(10, 100, 1000)
std::cout << "Computing
Fib( " << n << " )" << std::endl; std::cout << "Number of Tasks submitted by input thread
=> " << nrTasks << "." << std::endl;
long* pSums = reinterpret_cast<long*>(
tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks *
cacheLineSize);
TaskSchedulerInit init; tbb::tick_count t0 = tbb::tick_count::now(); tbb::parallel_for( tbb::blocked_range<size_t>( 0, nrTasks), ApplyTasks( n, pSums)); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << "Time Taken :: " << (t1 - t0).seconds()
<< std::endl;
std::cout << "Validating Results .. " << std::endl; long sum =
SerialFib( n); for( int i = 0; i < nrTasks; ++i) { if( sum != pSums[i * JumpFactor]) { std::cout
<< "Error in the Fibnocci Calculation for task " << i << std::endl; throw; } } std::cout << "Validation Passed" << std::endl; tbb::cache_aligned_allocator<char>().deallocate(
reinterpret_cast<char*>(pSums), 0); return 0; }
And here are the results
nrTasks Total Time taken( in seconds) 1
0.078 10 0.78 100 7.8 1000 78.3
| |
Re: Performance degradation
Are you sure that there is performance degradation? Here is the results for the first program I obtained on my
quad-core: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken ::
0.0789379 Number of Tasks submitted by input thread => 10. Time Taken :: 0.797832 Number of Tasks
submitted by input thread => 100. Time Taken :: 8.13045 Number of Tasks submitted by input thread =>
1000. Time Taken :: 79.8137
And here is for the second program: Computing Fib( 36 ) Number of
Tasks submitted by input thread => 1. Time Taken :: 0.0798096 Number of Tasks submitted by input thread
=> 10. Time Taken :: 0.796995 Number of Tasks submitted by input thread => 100. Time Taken ::
8.133 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.9169
I do not see any
significant performance degradation. Please, re-measure the first program.
| |
Re: Performance degradation
May this is what you meant because this seems to be fine.
Here is program 4
class
SpawnTasks { public: void operator()( const tbb::blocked_range<size_t>& r ) const { for(
size_t i = r.begin(); i != r.end(); ++i ) { FibTask& b = *new(
barrier->allocate_additional_child_of(*barrier)) FibTask( n, &sum[i * JumpFactor]); barrier->Spawn(
b); } }
SpawnTasks( long n_, long* sum_, EmptyTask* barrier_) : n(n_), sum(sum_),
barrier(barrier_) {}
long n; long* sum; EmptyTask* barrier; };
int
main( int argc, char *argv[]) { long n = argc>1 ? strtol(argv[1],0,0) : 36; int nrTasks = argc>2
? strtol(argv[2],0,0) : 100;
std::cout << "Computing Fib( " << n << " )" <<
std::endl; std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "."
<< std::endl; long* pSums = reinterpret_cast<long*>(
tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks *
cacheLineSize); TaskSchedulerInit init; tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask(); a.set_ref_count(1); tbb::parallel_for(
tbb::blocked_range<size_t>( 0, nrTasks), SpawnTasks( n, pSums, &a)); a.wait_for_all(); a.destroy(
a); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << "Time Taken :: " << (t1 -
t0).seconds() << std::endl; std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n); for( int i = 0; i < nrTasks; ++i) { if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl; throw; } } std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0); return 0; }
And here are the results
1 0.071 10 0.708 100 7.06 1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
| |
Re: Performance degradation
Are you sure that there is performance degradation? Here is the
results for the first program I obtained on my quad-core: Computing Fib( 36 ) Number of Tasks submitted by
input thread => 1. Time Taken :: 0.0789379 Number of Tasks submitted by input thread => 10. Time
Taken :: 0.797832 Number of Tasks submitted by input thread => 100. Time Taken :: 8.13045 Number of
Tasks submitted by input thread => 1000. Time Taken :: 79.8137
And here is for the second program: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0798096 Number
of Tasks submitted by input thread => 10. Time Taken :: 0.796995 Number of Tasks submitted by input thread
=> 100. Time Taken :: 8.133 Number of Tasks submitted by input thread => 1000. Time Taken ::
79.9169
I do not see any significant performance degradation. Please, re-measure the first program.
I did re-measure everything again now. The results are still the same.
I use Visual Studio 2005. Does this
have to do anything with the compiler optimizations?
| |
Re: Performance degradation
And here are the results
1
0.071 10 0.708 100 7.06 1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
I still get the same results for this program:
Computing Fib( 36 ) Number of Tasks submitted by
input thread => 1. Time Taken :: 0.0798366 Number of Tasks submitted by input thread => 10. Time
Taken :: 0.995838 Number of Tasks submitted by input thread => 100. Time Taken :: 7.97388 Number of
Tasks submitted by input thread => 1000. Time Taken :: 79.7281
They are all pretty much the same on
my quad-core machine...
| |
Re: Performance degradation
FYI. Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40 GHz 2.39 GHz, 3.86GB of RAM is the hardware that I use.
| |
Re: Performance degradation
I still get the same results for this program:
Computing Fib( 36 ) Number of Tasks submitted by
input thread => 1. Time Taken :: 0.0798366 Number of Tasks submitted by input thread => 10. Time
Taken :: 0.995838 Number of Tasks submitted by input thread => 100. Time Taken :: 7.97388 Number of
Tasks submitted by input thread => 1000. Time Taken :: 79.7281
They are all pretty much the same on
my quad-core machine...
Strange that the same code behaves differently on our machines. More confusions now. As I said before I have no
clue :)
| |
Re: Performance degradation
I have no clue whats going on any way. Why is this better than program
2?
There is some difference as to how tasks are allocated and spawned. In program 2 all root fib tasks are
allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task
deque, thus some increased contention during stealing. Version with parallel_for allocate and spawn tasks in a
distributed manner, so no above-mentioned problems. BUT I do NOT think that above-mentioned problems account for
such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
| |
Re: Performance degradation
There is some difference as to how tasks are allocated and spawned. In program 2 all root fib tasks are allocated
in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque,
thus some increased contention during stealing. Version with parallel_for allocate and spawn tasks in a distributed
manner, so no above-mentioned problems. BUT I do NOT think that above-mentioned problems account for such a big
performance difference, because most of the time is spent in fibonachi calculation anyway.
You may also try following code:
struct SpawnTask : tbb::task
{
int count;
int n;
long* result;
SpawnTask(int count, int n, long* result)
: count(count)
, n(n)
, result(result)
{}
virtual tbb::task* execute()
{
if (count == 1)
{
set_ref_count(2);
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
}
else if (count == 2)
{
set_ref_count(3);
spawn(*new(allocate_child()) FibTask( n, result));
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
}
else
{
int count2 = count / 2;
set_ref_count(3);
spawn(*new(allocate_child()) SpawnTask(count2, n, result));
spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
}
return 0;
}
};
p.s. I also test on Q6600.
| | |