An example to show performance data for different implementations of Pi calculating

Last time I posted the topic – "Compare Windows* threads, OpenMP*, Intel® Threading Building Blocks for parallel programming" </en-us/blogs/2008/12/16/compare-windows-threads-openmp-intel-threading-building-blocks-for-parallel-programming>, and listed their advantages and disadvantages.

 

Here is a simple example to show you performance data for different implementations of calculating Pi. Yes, this is an extreme simple example but explained my previous post well (you can control threads easily by using traditional Windows* APIs programming, OpenMP* code is effective and TBB code is optimized in template – at least simply educate you how to use OpenMP* and TBB in your code.

 

I used Intel® C/C++ Compiler 11.066 which is already published.

 

[Source begin]

 

#include<windows.h>

#include<stdio.h>

#include<iostream>

#include<time.h>

#include<omp.h>

 

#include"tbb/task_scheduler_init.h"

#include"tbb/parallel_for.h"

#include"tbb/blocked_range.h"

#include"tbb/spin_mutex.h"

#include"tbb/tick_count.h"

 

constint num_steps = 100000000;

constint num_threads = 4; // My laptop is T61

double step = 0.0, pi = 0.0;

 

static tbb::spin_mutex myMutex;

static CRITICAL_SECTION cs;

 

void Serial_Pi()

{

   double x, sum = 0.0;

   int i;

 

   step = 1.0/(double) num_steps;

   for (i=0; i< num_steps; i++){

      x = (i+0.5)*step;

      sum = sum + 4.0/(1.0 + x*x);

   }

   pi = step * sum;

}

 

 

DWORD WINAPI threadFunction(LPVOID pArg)

{

        double partialSum = 0.0, x;  // local to each thread

        int myNum = *((int *)pArg);

 

        step = 1.0/(double) num_steps;

        for ( int i=myNum; i<num_steps; i+=num_threads )  // use every num_threads step

        {

                x = (i + 0.5)*step;

                partialSum += 4.0 / (1.0 + x*x);  //compute partial sums at each thread

        }

 

        EnterCriticalSection(&cs);

          pi += partialSum * step;  // add partial to global final answer

        LeaveCriticalSection(&cs);

 

        return 0;

}

 

void WinThread_Pi()

{

        HANDLE threadHandles[num_threads];

        int tNum[num_threads];

 

        InitializeCriticalSection(&cs);

        step = 1.0 / num_steps;

        for ( int i=0; i<num_threads; ++i )

        {

                tNum[i] = i;

                threadHandles[i] = CreateThread( NULL,            // Security attributes

                                                 0,               // Stack size

                                                 threadFunction,  // Thread function

                                                 (LPVOID)&tNum[i],// Data for thread func()

                                                 0,               // Thread start mode

                                                 NULL);           // Returned thread ID

        }

        WaitForMultipleObjects(num_threads, threadHandles, TRUE, INFINITE);

 

}

 

void OpenMP_Pi()

{

double x, sum=0.0;

int i;

 

        step = 1.0 / (double)num_steps;

 

omp_set_num_threads(4);

#pragma omp parallel forprivate (x) reduction(+:sum) //schedule(static,4)

 

        for (i=0; i<num_steps; i++)

        {

                x = (i + 0.5)*step;

                sum = sum + 4.0/(1. + x*x);

        }

 

        pi = sum*step;

 

}

 

class ParallelPi {

       

public:

 

        voidoperator() (tbb::blocked_range<int>& range) const {

                double x, sum = 0.0;

                for (int i = range.begin(); i < range.end(); ++i) {

                        x = (i+0.5)*step;

                        sum = sum + 4.0/(1.0 + x*x);

                }

                tbb::spin_mutex::scoped_lock lock(myMutex);

                pi += step * sum;

        }

};

 

void TBB_Pi ()

{

        step = 1.0/(double) num_steps;

        parallel_for (tbb::blocked_range<int> (0, num_steps), ParallelPi(), tbb::auto_partitioner());

}

 

int main()

{

 

        clock_t start, stop;

 

        // Coputing pi by using serial code

        pi = 0.0;

        start = clock();

        Serial_Pi();

        stop = clock();

        printf ("Computed value of Pi by using serial code: %12.9f\n", pi);

        printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

 

        // Computing pi by using Windows Threads

        pi = 0.0;

        start = clock();

        WinThread_Pi();

        stop = clock();

        printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);

        printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

       

        // Computing pi by using OpenMP

        pi = 0.0;

        start = clock();

        OpenMP_Pi();

        stop = clock();

        printf ("Computed value of Pi by using OpenMP: %12.9f\n", pi);

        printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

 

        // Computing pi by using TBB

        tbb::task_scheduler_init tbb_init;

 

        pi = 0.0;

        start = clock();

        TBB_Pi();

        stop = clock();

        printf ("Computed value of Pi by using TBB: %12.9f\n", pi);

        printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

 

        return 0;

}

 [End of source]

 

Here are results:

Computed value of Pi by using serial code:  3.141592654

Elapsed time: 0.78 seconds

Computed value of Pi by using WinThreads:  3.141592654

Elapsed time: 0.55 seconds

Computed value of Pi by using OpenMP:  3.141592654

Elapsed time: 0.42 seconds

Computed value of Pi by using TBB:  3.141592654

Elapsed time: 0.44 seconds

 

That is why I recommend to use OpenMP* or TBB instead of WinThreads  

For more complete information about compiler optimizations, see our Optimization Notice.

20 comments

Top
Ilnar's picture

My results on core2duo E6750, 4Gb, FSB1066.

const int num_steps = 0x7FFFFFF0;

VC8 compiler:

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 26.06 seconds
Computed value of Pi by using WinThreads: 3.141592662
Elapsed time: 13.19 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 13.25 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 12.61 seconds

and VC8 with OpenMP shedule static 4 (it was helpful for VC8 openmp)
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 12.55 seconds

Intel C++ 11 (has great unrolling techniques - that's why we have smaller elapsed times):

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 12.91 seconds
Computed value of Pi by using WinThreads: 3.141592663
Elapsed time: 6.42 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 6.53 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 6.31 seconds

with OpenMP shedule static 4 (openmp results confused me, there are inverse situation)
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 12.53 seconds

BUT, i found some differences in WinThread code:
1. x = (i + 0.5f) / num_steps; -- you used / num_steps instead of *step
2. you used f suffix -- that means float data type, not double and need extra time for conversion

and I got the following results:

Computed value of Pi by using serial code: 3.141592654
Elapsed time: 12.91 seconds
Computed value of Pi by using WinThreads: 3.141592654
Elapsed time: 6.26 seconds
Computed value of Pi by using OpenMP: 3.141592654
Elapsed time: 6.63 seconds
Computed value of Pi by using TBB: 3.141592654
Elapsed time: 6.31 seconds

DWORD WINAPI threadFunction(LPVOID pArg)
{
double partialSum = 0.0, x; // local to each thread
int myNum = *((int *)pArg);

step = 1.0 / (double)num_steps;

for (int i=myNum; i

jimdempseyatthecove's picture

Also, your WinThread_Pi example is using 5 threads not 4 as the others.

To prvide fair assessment you would:

spawn 3 threads
call threadFunction directly with address of variable containing 3
wait for other threads to complete

Then rerun the test with the order of the tests reversed.

Jim Dempsey

Peter Wang (Intel)'s picture

I tried to run long time -
1> Computed value of Pi by using serial code: 3.141592654
1> Elapsed time: 7.94 seconds
1> Computed value of Pi by using WinThreads: 3.141592654
1> Elapsed time: 4.50 seconds
1> Computed value of Pi by using OpenMP: 3.141592654
1> Elapsed time: 4.16 seconds
1> Computed value of Pi by using TBB: 3.141592654
1> Elapsed time: 4.17 seconds

Peter Wang (Intel)'s picture

No result changed (or say "minimal") if move tbb:task_schedule_init tbb_init; into clock's start-to-stop.

Use "const int num_steps = 2147483647" will cause incorrect result, but if you use "2147483600" - that is ok. That was caused by integer boundary?

TBB is excellent, since it diid optimization on bets utilizing cache, best sechduling threads, etc.

I don't know OpenMP more detail, but it relies on C++ compiler - which also did C++ language level optimization but I don't know:-(

WinThreads relies on OS scheuler, but no more on C++ language level optimization.

anonymous's picture

Clearly Intel are biased when making such comparisons as the vendors of TBB. How about doing the decent thing and explaining why your WinThreads implementation is slower. Since the algorithms are all conceptually the same the difference is presumably in some startup overhead which is included in the WinThreads timing but not the others. Also 0.44s seems quite a short run time. How about running for 60seconds? And what machine are you on?

Basically this is not very scientific in my view.

Ilnar's picture

Peter Wang (Intel),
could you please get in comments updated results?
and, if it's possible, one more results when const int num_steps = 2147483647;//0x7FFFFFFF

Peter Wang (Intel)'s picture

Thanks to ilnarb.

Yes. I have to insert "tbb::task_scheduler_init tbb_init;" in elapsed time to measure.

Regards, Peter

anonymous's picture

I didn't see my name linked to my blog here, so if you can, please add

http://www.ShawnDrewry.com

to the previous post as well. Thank you so much and have a great new year Intel :-)

Shawn

anonymous's picture

I think I might have to take a few free non-degree courses in order to learn that kind of markup language, because that doesn't look like standard HTML

Ilnar's picture

I think there are difference between win threads and OpenMP/TBB: you create threads using CreateThread and include that time to elapsed time. Perhaps OpenMP precreates threads at start of program, and TBB creates them at line tbb::task_scheduler_init tbb_init;

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.