Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 9, 2009 10:53 PM PST
Performance degradation

This is about the performance degradation that I find in my application with changes in the tbb::task usage. I have made two sample programs which best replicate the cahnges I made to my application program. Most importantly it also replicates the performance degradation I find in my application as well.

// Serial Fibonnaci sum
long SerialFib( long n ) {
if( n < 2)
return n;
else
return SerialFib(n-1) + SerialFib(n-2);
}

static const int CutOff = 16;
// Parallel Fibonnaci sum using tbb::task
struct FibTask: public tbb::task{
public:
long n;
long x, y;
long* sum;
bool is_continuation;

FibTask( long n_, long* sum_ ) :
n(n_), sum(sum_), is_continuation(false), x(0), y(0)
{}

tbb::task* execute()
{
if( is_continuation ) {
*sum = x+y;
return NULL;
}
else
{
if( n<CutOff ) {
*sum = SerialFib(n);
return NULL;
}
else {
FibTask& a = *new(allocate_child()) FibTask( n-2, &x);
FibTask& b = *new(allocate_child()) FibTask( n-1, &y);
recycle_as_continuation();
is_continuation = true;
// Set ref_count to "two children".
set_ref_count(2);
spawn( b );
return &a;
}
}
}
};

long ParallelFib( long n ) {
long sum;
FibTask& a = *new(tbb::task::allocate_root()) FibTask( n, &sum);
tbb::task::spawn_root_and_wait(a);
return sum;
}

Here is the main program :
Main Program 1

int main( int argc, char *argv[])
{
long n = 36;
int nrTask = 1; // ( 10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl;

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
for( int i = 0; i < nrTasks; ++i)
long result = ParallelFib( n);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Result :: " << result << " Time Taken :: " << (t1 - t0).seconds() << std::endl;
return 0;
}

The results of main program 1 on a Quad core machine with nrTasks varied from 1, 10, 100, 1000 are as follows

nrTasks                    Total Time taken( in seconds)
1                                     0.071
10                                   0.70
100                                 6.99
1000                               69.83

But if I change the program as follows :
Main Program 2

static const int cacheLineSize = 64;
static const int JumpFactor = cacheLineSize / sizeof(long);

int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
for( int i = 0; i < nrTasks; ++i) {
FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]);
a.spawn( b);
}
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);
return 0;
}

The results of main program 2 are as follows

nrTasks       Total Time taken( in seconds)
1                   0.080
10                 0.80
100               8.015
1000             80.164

Now the results of the main program 1 are as expected. However the results of main program 2 suffer some performance degradation. It looks like per task overhead that is causing the degradation.(or is it because of something else.?)

The difference in main program 1 and main program 2 is that in Program 1 the main thread waits until each task is complete and in Program 2 the main thread just spawns all the tasks and then waits on all the tasks to complete.

Are there ways where the main program 2 could be made to perform as main program 1, keeping the fact that the main thread should spawn all the tasks before it can starts working on the taks spawned?
Raf Schietekat
Total Points:
16,765
Status Points:
16,765
Black Belt
November 9, 2009 11:27 PM PST
Rate
 
#1
What happens if you use parallel_for instead in the second program?


Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 1:00 AM PST
Rate
 
#2 Reply to #1
Quoting - Raf Schietekat
What happens if you use parallel_for instead in the second program?

here is the third program using parallel_for( I hope this was what you meant).

class ApplyTasks {
public:
void operator()( const tbb::blocked_range<size_t>& r ) const {
for( size_t i = r.begin(); i != r.end(); ++i )
sum[i * JumpFactor] = ParallelFib(n);
}

ApplyTasks( long n_, long* sum_) : n(n_), sum(sum_) {}

long n;
long* sum;
};

int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
tbb::parallel_for( tbb::blocked_range<size_t>( 0, nrTasks), ApplyTasks( n, pSums));
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);
return 0;
}

And here are the results

nrTasks           Total Time taken( in seconds)
1                         0.078
10                       0.78
100                     7.8
1000                   78.3



Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 1:07 AM PST
Rate
 
#3 Reply to #2
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137

And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169

I do not see any significant performance degradation.
Please, re-measure the first program.



Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 1:16 AM PST
Rate
 
#4 Reply to #2
May this is what you meant because this seems to be fine.

Here is program 4

class SpawnTasks {
public:
void operator()( const tbb::blocked_range<size_t>& r ) const {
for( size_t i = r.begin(); i != r.end(); ++i ) {
FibTask& b = *new( barrier->allocate_additional_child_of(*barrier)) FibTask( n, &sum[i * JumpFactor]);
barrier->Spawn( b);
}
}

SpawnTasks( long n_, long* sum_, EmptyTask* barrier_) : n(n_), sum(sum_), barrier(barrier_) {}

long n;
long* sum;
EmptyTask* barrier;
};

int main( int argc, char *argv[])
{
long n = argc>1 ? strtol(argv[1],0,0) : 36;
int nrTasks = argc>2 ? strtol(argv[2],0,0) : 100;

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
tbb::parallel_for( tbb::blocked_range<size_t>( 0, nrTasks), SpawnTasks( n, pSums, &a));
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);
return 0;
}

And here are the results

1                0.071
10              0.708
100            7.06
1000          70.66


I have no clue whats going on any way. Why is this better than program 2?


Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 1:27 AM PST
Rate
 
#5 Reply to #3
Quoting - Dmitriy Vyukov
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137

And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169

I do not see any significant performance degradation.
Please, re-measure the first program.

I did re-measure everything again now. The results are still the same.

I use Visual Studio 2005. Does this have to do anything with the compiler optimizations?


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 1:27 AM PST
Rate
 
#6 Reply to #4
Quoting - Shankar
And here are the results

1                0.071
10              0.708
100            7.06
1000          70.66


I have no clue whats going on any way. Why is this better than program 2?

I still get the same results for this program:

Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281

They are all pretty much the same on my quad-core machine...



Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 1:31 AM PST
Rate
 
#7 Reply to #5
FYI. Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40 GHz 2.39 GHz, 3.86GB of RAM is the hardware that I use.


Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 1:34 AM PST
Rate
 
#8 Reply to #6
Quoting - Dmitriy Vyukov

I still get the same results for this program:

Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281

They are all pretty much the same on my quad-core machine...

Strange that the same code behaves differently on our machines. More confusions now.
As I said before I have no clue :)


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 1:47 AM PST
Rate
 
#9 Reply to #4
Quoting - Shankar
I have no clue whats going on any way. Why is this better than program 2?

There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.




Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 1:48 AM PST
Rate
 
#10 Reply to #9
Quoting - Dmitriy Vyukov
There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.


You may also try following code:

struct SpawnTask : tbb::task
{
    int count;
    int n;
    long* result;

    SpawnTask(int count, int n, long* result)
        : count(count)
        , n(n)
        , result(result)
    {}

    virtual tbb::task* execute()
    {
        if (count == 1)
        {
            set_ref_count(2);
            spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
        }
        else if (count == 2)
        {
            set_ref_count(3);
            spawn(*new(allocate_child()) FibTask( n, result));
            spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
        }
        else
        {
            int count2 = count / 2;
            set_ref_count(3);
            spawn(*new(allocate_child()) SpawnTask(count2, n, result));
            spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
        }
        return 0;
    }
};

p.s. I also test on Q6600.


Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 2:21 AM PST
Rate
 
#11 Reply to #10
Quoting - Dmitriy Vyukov

You may also try following code:

struct SpawnTask : tbb::task
{
int count;
int n;
long* result;

SpawnTask(int count, int n, long* result)
: count(count)
, n(n)
, result(result)
{}

virtual tbb::task* execute()
{
if (count == 1)
{
set_ref_count(2);
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
}
else if (count == 2)
{
set_ref_count(3);
spawn(*new(allocate_child()) FibTask( n, result));
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
}
else
{
int count2 = count / 2;
set_ref_count(3);
spawn(*new(allocate_child()) SpawnTask(count2, n, result));
spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
}
return 0;
}
};

p.s. I also test on Q6600.

what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 2:33 AM PST
Rate
 
#12 Reply to #11
Quoting - Shankar
what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?

No, just:
SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums);
tbb::task::spawn_root_and_wait(a);



Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 2:34 AM PST
Rate
 
#13 Reply to #12
Run following test:

int main()
{
    long n = 36;
    tbb::task_scheduler_init init;
    int task_counts[] = {10, 50, 100};

    std::cout << "count:\t";
    for (int idx = 0; idx != 3; idx += 1)
        std::cout << task_counts[idx] << "\t";
    std::cout << std::endl;

    for (int test_count = 0; test_count != 3; test_count += 1)
    {
        std::cout << "#1:\t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            for( int i = 0; i < nrTasks; ++i)
                long result = ParallelFib( n);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "\t";
        }
        std::cout << std::endl;

        std::cout << "#2:\t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task();
            a.set_ref_count(1);
            for( int i = 0; i < nrTasks; ++i) {
                FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]);
                a.spawn( b);
            }
            a.wait_for_all();
            a.destroy( a);
            tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "\t";
        }
        std::cout << std::endl;

        std::cout << "#3:\t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task();
            a.set_ref_count(1);
            tbb::parallel_for( tbb::blocked_range<size_t>( 0, nrTasks), SpawnTasks( n, pSums, &a));
            a.wait_for_all();
            a.destroy( a);
            tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "\t";
        }
        std::cout << std::endl;

        std::cout << "#4:\t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast<long*>( tbb::cache_aligned_allocator<char>().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums);
            tbb::task::spawn_root_and_wait(a);
            tbb::cache_aligned_allocator<char>().deallocate( reinterpret_cast<char*>(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "\t";
        }
        std::cout << std::endl;
    }
}



Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 2:43 AM PST
Rate
 
#14 Reply to #13
Quoting - Dmitriy Vyukov
Run following test:


My results are:
count:  10      50      100
#1:     0.9601  3.988   7.95
#2:     0.7952  3.998   7.968
#3:     0.7949  4.003   7.92
#4:     0.7897  3.953   7.921
#1:     0.8067  3.987   7.979
#2:     0.8078  4.074   8.31
#3:     0.8302  4.206   8.468
#4:     0.8568  4.282   8.651
#1:     0.8695  4.392   8.858
#2:     0.8912  4.458   8.983
#3:     0.9028  4.527   9.135
#4:     0.9097  4.583   9.253

MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.



Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 5:07 AM PST
Rate
 
#15 Reply to #14
Quoting - Dmitriy Vyukov

My results are:
count:  10      50      100
#1: 0.9601 3.988 7.95
#2: 0.7952 3.998 7.968
#3: 0.7949 4.003 7.92
#4: 0.7897 3.953 7.921
#1: 0.8067 3.987 7.979
#2: 0.8078 4.074 8.31
#3: 0.8302 4.206 8.468
#4: 0.8568 4.282 8.651
#1: 0.8695 4.392 8.858
#2: 0.8912 4.458 8.983
#3: 0.9028 4.527 9.135
#4: 0.9097 4.583 9.253

MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.


Here are the results :

count: 10 50 100 #1: 0.8022 4.003 8.152 #2: 0.8 4.119 8.009 #3: 0.7991 3.989 7.98 #4: 0.8006 3.986 7.981 #1: 0.7984 3.993 7.986 #2: 0.7992 4.004 8.002 #3: 0.7984 3.994 7.992 #4: 0.8007 4.145 7.996 #1: 0.7987 3.992 7.973 #2: 0.801 4.006 8.006 #3: 0.7987 3.996 7.983 #4: 0.7992 3.985 8
They look exactly like the results you have got. :)

However I would want you to try running this file TestScheduler.cpp.
What I have done here is that I have commented the options 2, 3 and 4 in your code. I know this might sound weird but the results when I run these are as follows
count:  10        50       100
#1:     0.7038  3.512   7.043
#1:     0.697    3.517   6.991
#1:     0.7063  3.512   7.036

I would like to know if you get the same results too as mine.


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 10, 2009 7:36 AM PST
Rate
 
|Best Answer
#16 Reply to #15
Humm.... this is getting interesting.

When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.

Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.



Shankar
Total Points:
535
Status Points:
35
Brown Belt
November 10, 2009 10:02 PM PST
Rate
 
#17 Reply to #16
Quoting - Dmitriy Vyukov
Humm.... this is getting interesting.

When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.

Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.


Hi Dmitriy,

I think this suggestion of yours helped in some way. I did the following steps

* Moved the FibTask, Spawner and SpawnTask classes to a file named FibTask.h. I only kept only the declarations of the methods int the .h file and moved the implementation of the methods to FibTask.cpp.

* I also moved SerialFib and ParallelFib declarations to FibTask.h and their implementation to FibTask.cpp

* Now only my main function lies in the file TestMain.cpp and the code is unchanged.


And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).
I switched on /GL and /LTCG and ran the two tests again.


Here are the results now


1. Program compiled by switching off /GL and/LTCG options

option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.7955 3.979 7.961
#1: 0.7957 3.981 7.997
#1: 0.8337 4.331 7.994

option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7576 3.771 7.517
#2: 0.75 3.742 7.499
#3: 0.7944 3.835 7.518
#4: 0.7524 3.705 7.483
#1: 0.7438 3.688 7.367
#2: 0.7357 3.664 7.37
#3: 0.7383 3.686 7.404
#4: 0.7541 3.761 7.48
#1: 0.7537 3.724 7.427
#2: 0.7537 3.731 7.454
#3: 0.7507 3.738 7.504
#4: 0.7526 3.72 7.853

2. Program compiled by switching on /GL and/LTCG options

option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.6983 3.493 6.994
#1: 0.6994 3.493 6.981
#1: 0.7032 3.508 7.013

option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7027 3.519 6.995
#2: 0.6967 3.494 7
#3: 0.6969 3.491 7
#4: 0.7008 3.489 6.993
#1: 0.6984 3.508 7.001
#2: 0.6977 3.497 6.99
#3: 0.7027 3.492 6.99
#4: 0.6986 3.488 6.989
#1: 0.7008 3.498 7.016
#2: 0.6981 3.49 6.99
#3: 0.7016 3.495 6.985
#4: 0.6992 3.495 7.112


And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).

So what has helped in your suggestion is that you told me to move the implementation of FibTask, Spawner and SpawnTask to another .cpp file.

And now all the different options(i.e #1, #2, #3 , #4) perform equally(as you expect) and in all of them each task takes only 0.070 sec ( as I want).

Atleast the problem is solved now. But Im not clear as to why it got solved by moving implementation to another cpp file?

Is this because of the task's vtable replication or something? Because I have ran into that problem when I tried to wrap the tbb::task class(by deriving from tbb::task) and overrided a virtual method note_affinity providing the implementation in the .h file itself. But the solution came in from the comments on top of the function void task::note_affinity( affinity_id ) defined in task.cpp file.



Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 11, 2009 1:08 AM PST
Rate
 
#18 Reply to #17
Quoting - Shankar
Atleast the problem is solved now.


Cool!

Quoting - Shankar
But Im not clear as to why it got solved by moving implementation to another cpp file?

I think compiler apply some optimization to FibTask::execute() based on some sophisticated condition.





Intel Software Network Forums Statistics

8445 users have contributed to 31553 threads and 100390 posts to date.
In the past 24 hours, we have 11 new thread(s) 26 new posts(s), and 41 new user(s).

In the past 3 days, the most popular thread for everyone has been Lost in MKL The most posts were made to TBB on linux segfaulting The post with the most views is collapse requires the loops

Please welcome our newest member nonamez