Submitted by Peter Wang (Intel) on

Last time I posted the topic – "Compare Windows* threads, OpenMP*, Intel® Threading Building Blocks for parallel programming" </en-us/blogs/2008/12/16/compare-windows-threads-openmp-intel-threading-building-blocks-for-parallel-programming >, and listed their advantages and disadvantages.

Here is a simple example to show you performance data for different implementations of calculating Pi. Yes, this is an extreme simple example but explained my previous post well (you can control threads easily by using traditional Windows* APIs programming, OpenMP* code is effective and TBB code is optimized in template – at least simply educate you how to use OpenMP* and TBB in your code.

I used Intel® C/C++ Compiler 11.066 which is already published.

[Source begin]

#include <windows.h>

#include <stdio.h>

#include <iostream>

#include <time.h>

#include <omp.h>

#include "tbb/task_scheduler_init.h"

#include "tbb/parallel_for.h"

#include "tbb/blocked_range.h"

#include "tbb/spin_mutex.h"

#include "tbb/tick_count.h"

const int num_steps = 100000000;

const int num_threads = 4; // My laptop is T61

double step = 0.0, pi = 0.0;

static tbb::spin_mutex myMutex;

static CRITICAL_SECTION cs;

void Serial_Pi()

{

double x, sum = 0.0;

int i;

step = 1.0/(double) num_steps;

for (i=0; i< num_steps; i++){

x = (i+0.5)*step;

sum = sum + 4.0/(1.0 + x*x);

}

pi = step * sum;

}

DWORD WINAPI threadFunction(LPVOID pArg)

{

double partialSum = 0.0, x; // local to each thread

int myNum = *((int *)pArg);

step = 1.0/(double) num_steps;

for ( int i=myNum; i<num_steps; i+=num_threads ) // use every num_threads step

{

x = (i + 0.5)*step;

partialSum += 4.0 / (1.0 + x*x); //compute partial sums at each thread

}

EnterCriticalSection(&cs);

pi += partialSum * step; // add partial to global final answer

LeaveCriticalSection(&cs);

return 0;

}

void WinThread_Pi()

{

HANDLE threadHandles[num_threads];

int tNum[num_threads];

InitializeCriticalSection(&cs);

step = 1.0 / num_steps;

for ( int i=0; i<num_threads; ++i )

{

tNum[i] = i;

threadHandles[i] = CreateThread( NULL, // Security attributes

0, // Stack size

threadFunction, // Thread function

(LPVOID)&tNum[i],// Data for thread func()

0, // Thread start mode

NULL); // Returned thread ID

}

WaitForMultipleObjects(num_threads, threadHandles, TRUE, INFINITE);

}

void OpenMP_Pi()

{

double x, sum=0.0;

int i;

step = 1.0 / (double)num_steps;

omp_set_num_threads(4);

#pragma omp parallel for private (x) reduction(+:sum) //schedule(static,4)

for (i=0; i<num_steps; i++)

{

x = (i + 0.5)*step;

sum = sum + 4.0/(1. + x*x);

}

pi = sum*step;

}

class ParallelPi {

public:

void operator() (tbb::blocked_range<int>& range) const {

double x, sum = 0.0;

for (int i = range.begin(); i < range.end(); ++i) {

x = (i+0.5)*step;

sum = sum + 4.0/(1.0 + x*x);

}

tbb::spin_mutex::scoped_lock lock(myMutex);

pi += step * sum;

}

};

void TBB_Pi ()

{

step = 1.0/(double) num_steps;

parallel_for (tbb::blocked_range<int> (0, num_steps), ParallelPi(), tbb::auto_partitioner());

}

int main()

{

clock_t start, stop;

// Coputing pi by using serial code

pi = 0.0;

start = clock();

Serial_Pi();

stop = clock();

printf ("Computed value of Pi by using serial code: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using Windows Threads

pi = 0.0;

start = clock();

WinThread_Pi();

stop = clock();

printf ("Computed value of Pi by using WinThreads: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using OpenMP

pi = 0.0;

start = clock();

OpenMP_Pi();

stop = clock();

printf ("Computed value of Pi by using OpenMP: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

// Computing pi by using TBB

tbb::task_scheduler_init tbb_init;

pi = 0.0;

start = clock();

TBB_Pi();

stop = clock();

printf ("Computed value of Pi by using TBB: %12.9f\n", pi);

printf ("Elapsed time: %.2f seconds\n", (double)(stop-start)/1000.0);

return 0;

}

[End of source]

Here are results:

Computed value of Pi by using serial code: 3.141592654

Elapsed time: 0.78 seconds

Computed value of Pi by using WinThreads: 3.141592654

Elapsed time: 0.55 seconds

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 0.42 seconds

Computed value of Pi by using TBB: 3.141592654

Elapsed time: 0.44 seconds

That is why I recommend to use OpenMP* or TBB instead of WinThreads

## Comments (20)

TopIlnar said on

My results on core2duo E6750, 4Gb, FSB1066.

const int num_steps = 0x7FFFFFF0;

VC8 compiler:

Computed value of Pi by using serial code: 3.141592654

Elapsed time: 26.06 seconds

Computed value of Pi by using WinThreads: 3.141592662

Elapsed time: 13.19 seconds

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 13.25 seconds

Computed value of Pi by using TBB: 3.141592654

Elapsed time: 12.61 seconds

and VC8 with OpenMP shedule static 4 (it was helpful for VC8 openmp)

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 12.55 seconds

Intel C++ 11 (has great unrolling techniques - that's why we have smaller elapsed times):

Computed value of Pi by using serial code: 3.141592654

Elapsed time: 12.91 seconds

Computed value of Pi by using WinThreads: 3.141592663

Elapsed time: 6.42 seconds

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 6.53 seconds

Computed value of Pi by using TBB: 3.141592654

Elapsed time: 6.31 seconds

with OpenMP shedule static 4 (openmp results confused me, there are inverse situation)

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 12.53 seconds

BUT, i found some differences in WinThread code:

1. x = (i + 0.5f) / num_steps; -- you used / num_steps instead of *step

2. you used f suffix -- that means float data type, not double and need extra time for conversion

and I got the following results:

Computed value of Pi by using serial code: 3.141592654

Elapsed time: 12.91 seconds

Computed value of Pi by using WinThreads: 3.141592654

Elapsed time: 6.26 seconds

Computed value of Pi by using OpenMP: 3.141592654

Elapsed time: 6.63 seconds

Computed value of Pi by using TBB: 3.141592654

Elapsed time: 6.31 seconds

DWORD WINAPI threadFunction(LPVOID pArg)

{

double partialSum = 0.0, x; // local to each thread

int myNum = *((int *)pArg);

step = 1.0 / (double)num_steps;

for (int i=myNum; i

jimdempseyatthecove said on

Also, your WinThread_Pi example is using 5 threads not 4 as the others.

To prvide fair assessment you would:

spawn 3 threads

call threadFunction directly with address of variable containing 3

wait for other threads to complete

Then rerun the test with the order of the tests reversed.

Jim Dempsey

Peter Wang (Intel) said on

I tried to run long time -

1> Computed value of Pi by using serial code: 3.141592654

1> Elapsed time: 7.94 seconds

1> Computed value of Pi by using WinThreads: 3.141592654

1> Elapsed time: 4.50 seconds

1> Computed value of Pi by using OpenMP: 3.141592654

1> Elapsed time: 4.16 seconds

1> Computed value of Pi by using TBB: 3.141592654

1> Elapsed time: 4.17 seconds

Peter Wang (Intel) said on

No result changed (or say "minimal") if move tbb:task_schedule_init tbb_init; into clock's start-to-stop.

Use "const int num_steps = 2147483647" will cause incorrect result, but if you use "2147483600" - that is ok. That was caused by integer boundary?

TBB is excellent, since it diid optimization on bets utilizing cache, best sechduling threads, etc.

I don't know OpenMP more detail, but it relies on C++ compiler - which also did C++ language level optimization but I don't know:-(

WinThreads relies on OS scheuler, but no more on C++ language level optimization.

Anonymous said on

Clearly Intel are biased when making such comparisons as the vendors of TBB. How about doing the decent thing and explaining why your WinThreads implementation is slower. Since the algorithms are all conceptually the same the difference is presumably in some startup overhead which is included in the WinThreads timing but not the others. Also 0.44s seems quite a short run time. How about running for 60seconds? And what machine are you on?

Basically this is not very scientific in my view.

Ilnar said on

Peter Wang (Intel),

could you please get in comments updated results?

and, if it's possible, one more results when const int num_steps = 2147483647;//0x7FFFFFFF

Peter Wang (Intel) said on

Thanks to ilnarb.

Yes. I have to insert "tbb::task_scheduler_init tbb_init;" in elapsed time to measure.

Regards, Peter

Anonymous said on

I didn't see my name linked to my blog here, so if you can, please add

http://www.ShawnDrewry.com

to the previous post as well. Thank you so much and have a great new year Intel :-)

Shawn

Anonymous said on

I think I might have to take a few free non-degree courses in order to learn that kind of markup language, because that doesn't look like standard HTML

Ilnar said on

I think there are difference between win threads and OpenMP/TBB: you create threads using CreateThread and include that time to elapsed time. Perhaps OpenMP precreates threads at start of program, and TBB creates them at line tbb::task_scheduler_init tbb_init;

## Pages

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today