# An example to show performance data for different implementations of Pi calculating

Here is a simple example to show you performance data for different implementations of calculating Pi. Yes, this is an extreme simple example but explained my previous post well (you can control threads easily by using traditional Windows* APIs programming, OpenMP* code is effective and TBB code is optimized in template – at least simply educate you how to use OpenMP* and TBB in your code.

I used Intel® C/C++ Compiler 11.066 which is already published.

[Source begin]

#include<windows.h>

#include<stdio.h>

#include<iostream>

#include<time.h>

#include<omp.h>

#include"tbb/parallel_for.h"

#include"tbb/blocked_range.h"

#include"tbb/spin_mutex.h"

#include"tbb/tick_count.h"

constint num_steps = 100000000;

constint num_threads = 4; // My laptop is T61

double step = 0.0, pi = 0.0;

static tbb::spin_mutex myMutex;

static CRITICAL_SECTION cs;

void Serial_Pi()

{

double x, sum = 0.0;

int i;

step = 1.0/(double) num_steps;

for (i=0; i< num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0 + x*x);

}

pi = step * sum;

}

{

double partialSum = 0.0, x;  // local to each thread

int myNum = *((int *)pArg);

step = 1.0/(double) num_steps;

for ( int i=myNum; i<num_steps; i+=num_threads )  // use every num_threads step

{

x = (i + 0.5)*step;

partialSum += 4.0 / (1.0 + x*x);  //compute partial sums at each thread

}

EnterCriticalSection(&cs);

pi += partialSum * step;  // add partial to global final answer

LeaveCriticalSection(&cs);

return 0;

}

{

InitializeCriticalSection(&cs);

step = 1.0 / num_steps;

for ( int i=0; i<num_threads; ++i )

{

tNum[i] = i;

0,               // Stack size

}

}

void OpenMP_Pi()

{

double x, sum=0.0;

int i;

step = 1.0 / (double)num_steps;

#pragma omp parallel forprivate (x) reduction(+:sum) //schedule(static,4)

for (i=0; i<num_steps; i++)

{

x = (i + 0.5)*step;

sum = sum + 4.0/(1. + x*x);

}

pi = sum*step;

}

class ParallelPi {

public:

voidoperator() (tbb::blocked_range<int>& range) const {

double x, sum = 0.0;

for (int i = range.begin(); i < range.end(); ++i) {

x = (i+0.5)*step;

sum = sum + 4.0/(1.0 + x*x);

}

tbb::spin_mutex::scoped_lock lock(myMutex);

pi += step * sum;

}

};

void TBB_Pi ()

{

step = 1.0/(double) num_steps;

parallel_for (tbb::blocked_range<int> (0, num_steps), ParallelPi(), tbb::auto_partitioner());

}

int main()

{

clock_t start, stop;

// Coputing pi by using serial code

pi = 0.0;

start = clock();

Serial_Pi();

stop = clock();

printf ("Computed value of Pi by using serial code: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using Windows Threads

pi = 0.0;

start = clock();

stop = clock();

printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using OpenMP

pi = 0.0;

start = clock();

OpenMP_Pi();

stop = clock();

printf ("Computed value of Pi by using OpenMP: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using TBB

pi = 0.0;

start = clock();

TBB_Pi();

stop = clock();

printf ("Computed value of Pi by using TBB: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

return 0;

}

[End of source]

Here are results:

Computed value of Pi by using serial code:  3.141592654

Elapsed time: 0.78 seconds

Computed value of Pi by using WinThreads:  3.141592654

Elapsed time: 0.55 seconds

Computed value of Pi by using OpenMP:  3.141592654

Elapsed time: 0.42 seconds

Computed value of Pi by using TBB:  3.141592654

Elapsed time: 0.44 seconds

That is why I recommend to use OpenMP* or TBB instead of WinThreads

For more complete information about compiler optimizations, see our Optimization Notice.

Top

My results on core2duo E6750, 4Gb, FSB1066.

const int num_steps = 0x7FFFFFF0;

VC8 compiler:

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 26.06 seconds
Computed value of Pi by using WinThreads: 3.141592662
Elapsed time: 13.19 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 13.25 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 12.61 seconds

and VC8 with OpenMP shedule static 4 (it was helpful for VC8 openmp)
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 12.55 seconds

Intel C++ 11 (has great unrolling techniques - that's why we have smaller elapsed times):

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 12.91 seconds
Computed value of Pi by using WinThreads: 3.141592663
Elapsed time: 6.42 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 6.53 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 6.31 seconds

with OpenMP shedule static 4 (openmp results confused me, there are inverse situation)
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 12.53 seconds

BUT, i found some differences in WinThread code:
1. x = (i + 0.5f) / num_steps; -- you used / num_steps instead of *step
2. you used f suffix -- that means float data type, not double and need extra time for conversion

and I got the following results:

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 12.91 seconds
Computed value of Pi by using WinThreads: 3.141592654
Elapsed time: 6.26 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 6.63 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 6.31 seconds

{
double partialSum = 0.0, x; // local to each thread
int myNum = *((int *)pArg);

step = 1.0 / (double)num_steps;

for (int i=myNum; i

To prvide fair assessment you would:

wait for other threads to complete

Then rerun the test with the order of the tests reversed.

Jim Dempsey

I tried to run long time -
1> Computed value of Pi by using serial code: 3.141592654
1> Elapsed time: 7.94 seconds
1> Computed value of Pi by using WinThreads: 3.141592654
1> Elapsed time: 4.50 seconds
1> Computed value of Pi by using OpenMP: 3.141592654
1> Elapsed time: 4.16 seconds
1> Computed value of Pi by using TBB: 3.141592654
1> Elapsed time: 4.17 seconds

No result changed (or say "minimal") if move tbb:task_schedule_init tbb_init; into clock's start-to-stop.

Use "const int num_steps = 2147483647" will cause incorrect result, but if you use "2147483600" - that is ok. That was caused by integer boundary?

TBB is excellent, since it diid optimization on bets utilizing cache, best sechduling threads, etc.

I don't know OpenMP more detail, but it relies on C++ compiler - which also did C++ language level optimization but I don't know:-(

WinThreads relies on OS scheuler, but no more on C++ language level optimization.

Clearly Intel are biased when making such comparisons as the vendors of TBB. How about doing the decent thing and explaining why your WinThreads implementation is slower. Since the algorithms are all conceptually the same the difference is presumably in some startup overhead which is included in the WinThreads timing but not the others. Also 0.44s seems quite a short run time. How about running for 60seconds? And what machine are you on?

Basically this is not very scientific in my view.

Peter Wang (Intel),
and, if it's possible, one more results when const int num_steps = 2147483647;//0x7FFFFFFF

Thanks to ilnarb.

Yes. I have to insert "tbb::task_scheduler_init tbb_init;" in elapsed time to measure.

Regards, Peter

http://www.ShawnDrewry.com

to the previous post as well. Thank you so much and have a great new year Intel :-)

Shawn

I think I might have to take a few free non-degree courses in order to learn that kind of markup language, because that doesn't look like standard HTML

I think there are difference between win threads and OpenMP/TBB: you create threads using CreateThread and include that time to elapsed time. Perhaps OpenMP precreates threads at start of program, and TBB creates them at line tbb::task_scheduler_init tbb_init;