For-loop performance, what's wrong?

For-loop performance, what's wrong?

tbbphreak's picture

Hello,

I'm new to TBB and just started experimenting with it using tutorials. My first attempt is to test performance of a simple loop over a big array of floats. Once using TBB, and without. Comparing the time required for each tech, it was surprising. Check yourself and correct me if I'm doing somethign wrong:

#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"
#include

using namespace tbb;

#define BIGARRSIZE 100000

float big_arr[BIGARRSIZE];

void Foo(float *a)
{
(*a)++;
}

class ApplyFoo {
public:
void operator()(const blocked_range& r) const {
for(size_t i = r.begin(); i != r.end(); i++) {
Foo(&big_arr[i]);
}
}
};

void main()
{
tick_count t0, t1;
int nthreads = 2;

task_scheduler_init init(task_scheduler_init::deferred);
if (nthreads >= 1)
init.initialize(nthreads);

t0 = tick_count::now();
parallel_for(blocked_range(0,BIGARRSIZE), ApplyFoo(), auto_partitioner());
t1 = tick_count::now();
printf("\n*** work took %g seconds ***", (t1 - t0).seconds());

if (nthreads >= 1)
init.terminate();

t0 = tick_count::now();
for (int i = 0; i < BIGARRSIZE; i++)
Foo(&big_arr[i]);
t1 = tick_count::now();
printf("\n*** work took %f seconds ***", (t1 - t0).seconds());

printf("\n");
}

Thanks.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
tbbphreak's picture

An update to this piece of code:

#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"

using namespace tbb;

#define BIGARRSIZE 10000000

float big_arr[BIGARRSIZE];

void Foo(float *a)
{
(*a)++;
}

class ApplyFoo {
public:
void operator()(const blocked_range& r) const {
for(size_t i = r.begin(); i != r.end(); i++) {
Foo(&big_arr[i]);
}
}
};

void ApplyFooRange(int start, int end)
{
for(int i = start; i <= end; i++)
Foo(&big_arr[i]);
}

DWORD WINAPI FooPart1(LPVOID param)
{
ApplyFooRange(0, BIGARRSIZE / 2 - 1);

return 0;
}

DWORD WINAPI FooPart2(LPVOID param)
{
ApplyFooRange(BIGARRSIZE / 2, BIGARRSIZE - 1);

return 0;
}

DWORD threadIDs[4];
HANDLE hThreads[4];

void main()
{
tick_count t0, t1;
int nthreads = 2;

task_scheduler_init init(task_scheduler_init::deferred);
if (nthreads >= 1)
init.initialize(nthreads);

t0 = tick_count::now();
parallel_for(blocked_range(0,BIGARRSIZE), ApplyFoo(), auto_partitioner());
t1 = tick_count::now();
printf("n*** work took %g seconds ***", (t1 - t0).seconds());

if (nthreads >= 1)
init.terminate();

t0 = tick_count::now();
for (int i = 0; i < BIGARRSIZE; i++)
Foo(&big_arr[i]);
t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

omp_set_num_threads(2);

t0 = tick_count::now();
#pragma omp default(none) private(i) shared(big_arr)
{
#pragma omp for
for (int i = 0; i < BIGARRSIZE; i++)
#pragma omp atomic
Foo(&big_arr[i]);
}
t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

t0 = tick_count::now();
hThreads[0] = CreateThread(NULL, 0, FooPart1, NULL, 0, &threadIDs[0]);
hThreads[1] = CreateThread(NULL, 0, FooPart2, NULL, 0, &threadIDs[1]);

WaitForMultipleObjects(2, hThreads, TRUE, INFINITE);

t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

CloseHandle(hThreads[0]);
CloseHandle(hThreads[1]);

printf("n");
}

Try it yourself. Really impressive! :)

tbbphreak's picture

Hope my conclusion will help. As the complexity of the inner threaded loop increase, and the size of the array goes up by order of magnitudes, the gain in performance is very apparent, and makes difference in TBB. Creating threads manually can be a lightly faster, but as the code becomes more complicated, the TBB roles and rocks. My vote is TBB are awesome!

Login to leave a comment.