November 9, 2009 10:53 PM PST
Performance degradation
This is about the performance degradation that I find in my application with changes in the tbb::task usage. I have made two sample programs which best replicate the cahnges I made to my application program. Most importantly it also replicates the performance degradation I find in my application as well.
// Serial Fibonnaci sum long SerialFib( long n ) { if( n < 2) return n; else return SerialFib(n-1) + SerialFib(n-2); }
static const int CutOff = 16; // Parallel Fibonnaci sum using tbb::task struct FibTask: public tbb::task{ public: long n; long x, y; long* sum; bool is_continuation;
long ParallelFib( long n ) { long sum; FibTask& a = *new(tbb::task::allocate_root()) FibTask( n, &sum); tbb::task::spawn_root_and_wait(a); return sum; }
Here is the main program : Main Program 1
int main( int argc, char *argv[]) { long n = 36; int nrTask = 1; // ( 10, 100, 1000) std::cout << "Computing Fib( " << n << " )" << std::endl;
TaskSchedulerInit init; tbb::tick_count t0 = tbb::tick_count::now(); for( int i = 0; i < nrTasks; ++i) long result = ParallelFib( n); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << "Result :: " << result << " Time Taken :: " << (t1 - t0).seconds() << std::endl; return 0; }
The results of main program 1 on a Quad core machine with nrTasks varied from 1, 10, 100, 1000 are as follows
nrTasks Total Time taken( in seconds) 1 0.071 10 0.70 100 6.99 1000 69.83
But if I change the program as follows : Main Program 2
static const int cacheLineSize = 64; static const int JumpFactor = cacheLineSize / sizeof(long);
int main( int argc, char *argv[]) { long n = 36; int nrTasks = 1; //(10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl; std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;
TaskSchedulerInit init; tbb::tick_count t0 = tbb::tick_count::now(); EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask(); a.set_ref_count(1); for( int i = 0; i < nrTasks; ++i) { FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]); a.spawn( b); } a.wait_for_all(); a.destroy( a); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl; long sum = SerialFib( n); for( int i = 0; i < nrTasks; ++i) { if( sum != pSums[i * JumpFactor]) { std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl; throw; } } std::cout << "Validation Passed" << std::endl; tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0); return 0; }
The results of main program 2 are as follows
nrTasks Total Time taken( in seconds) 1 0.080 10 0.80 100 8.015 1000 80.164
Now the results of the main program 1 are as expected. However the results of main program 2 suffer some performance degradation. It looks like per task overhead that is causing the degradation.(or is it because of something else.?)
The difference in main program 1 and main program 2 is that in Program 1 the main thread waits until each task is complete and in Program 2 the main thread just spawns all the tasks and then waits on all the tasks to complete.
Are there ways where the main program 2 could be made to perform as main program 1, keeping the fact that the main thread should spawn all the tasks before it can starts working on the taks spawned?
Are you sure that there is performance degradation? Here is the results for the first program I obtained on my quad-core: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0789379 Number of Tasks submitted by input thread => 10. Time Taken :: 0.797832 Number of Tasks submitted by input thread => 100. Time Taken :: 8.13045 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.8137
And here is for the second program: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0798096 Number of Tasks submitted by input thread => 10. Time Taken :: 0.796995 Number of Tasks submitted by input thread => 100. Time Taken :: 8.133 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.9169
I do not see any significant performance degradation. Please, re-measure the first program.
Are you sure that there is performance degradation? Here is the results for the first program I obtained on my quad-core: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0789379 Number of Tasks submitted by input thread => 10. Time Taken :: 0.797832 Number of Tasks submitted by input thread => 100. Time Taken :: 8.13045 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.8137
And here is for the second program: Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0798096 Number of Tasks submitted by input thread => 10. Time Taken :: 0.796995 Number of Tasks submitted by input thread => 100. Time Taken :: 8.133 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.9169
I do not see any significant performance degradation. Please, re-measure the first program.
I did re-measure everything again now. The results are still the same.
I use Visual Studio 2005. Does this have to do anything with the compiler optimizations?
I have no clue whats going on any way. Why is this better than program 2?
I still get the same results for this program:
Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0798366 Number of Tasks submitted by input thread => 10. Time Taken :: 0.995838 Number of Tasks submitted by input thread => 100. Time Taken :: 7.97388 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.7281
They are all pretty much the same on my quad-core machine...
Computing Fib( 36 ) Number of Tasks submitted by input thread => 1. Time Taken :: 0.0798366 Number of Tasks submitted by input thread => 10. Time Taken :: 0.995838 Number of Tasks submitted by input thread => 100. Time Taken :: 7.97388 Number of Tasks submitted by input thread => 1000. Time Taken :: 79.7281
They are all pretty much the same on my quad-core machine...
Strange that the same code behaves differently on our machines. More confusions now. As I said before I have no clue :)
I have no clue whats going on any way. Why is this better than program 2?
There is some difference as to how tasks are allocated and spawned. In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing. Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems. BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
There is some difference as to how tasks are allocated and spawned. In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing. Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems. BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
You may also try following code:
struct SpawnTask : tbb::task
{
int count;
int n;
long* result;
SpawnTask(int count, int n, long* result)
: count(count)
, n(n)
, result(result)
{}
virtual tbb::task* execute()
{
if (count == 1)
{
set_ref_count(2);
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
}
else if (count == 2)
{
set_ref_count(3);
spawn(*new(allocate_child()) FibTask( n, result));
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
}
else
{
int count2 = count / 2;
set_ref_count(3);
spawn(*new(allocate_child()) SpawnTask(count2, n, result));
spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
}
return 0;
}
};
They look exactly like the results you have got. :)
However I would want you to try running this file TestScheduler.cpp. What I have done here is that I have commented the options 2, 3 and 4 in your code. I know this might sound weird but the results when I run these are as follows
When comment out everything except #1 I still get the same result - 8 seconds. Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
When comment out everything except #1 I still get the same result - 8 seconds. Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
Hi Dmitriy,
I think this suggestion of yours helped in some way. I did the following steps
* Moved the FibTask, Spawner and SpawnTask classes to a file named FibTask.h. I only kept only the declarations of the methods int the .h file and moved the implementation of the methods to FibTask.cpp.
* I also moved SerialFib and ParallelFib declarations to FibTask.h and their implementation to FibTask.cpp
* Now only my main function lies in the file TestMain.cpp and the code is unchanged.
And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run). I switched on /GL and /LTCG and ran the two tests again.
Here are the results now
1. Program compiled by switching off /GL and/LTCG options
And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).
So what has helped in your suggestion is that you told me to move the implementation of FibTask, Spawner and SpawnTask to another .cpp file.
And now all the different options(i.e #1, #2, #3 , #4) perform equally(as you expect) and in all of them each task takes only 0.070 sec ( as I want).
Atleast the problem is solved now. But Im not clear as to why it got solved by moving implementation to another cpp file?
Is this because of the task's vtable replication or something? Because I have ran into that problem when I tried to wrap the tbb::task class(by deriving from tbb::task) and overrided a virtual method note_affinity providing the implementation in the .h file itself. But the solution came in from the comments on top of the function void task::note_affinity( affinity_id ) defined in task.cpp file.