Can OpenMP parallelize function level

Can OpenMP parallelize function level

Hi all,

I would like to decide on which threading method to use. I need to know few things...

First: Can OpenMP 3.0or TBB parallelize functions level?

Second: How can I know if a function is large enough to be parallelized? Is there any scale or a method to know?

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - m_enayah@hotmail.com

Hi all,

I would like to decide on which threading method to use. I need to know few things...

First: Can OpenMP 3.0or TBB parallelize functions level?

OpemMP 3.0 can parallelize at function level. Here is an example for Intel C++ compiler, which supports task extensions:

#pragma omp parallel
#pragma intel omp taskq
{
#pragma intel omp task
    f1();
#pragma intel omp task
    f2();
#pragma intel omp task
    f3();
}

And OpenMP 3.0 includes similar task functionality.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - m_enayah@hotmail.com

Second: How can I know if a function is large enough to be parallelized? Is there any scale or a method to know?

The separate "work item" for parallel execution must be not less than around 10,000 cycles or so. I was measuring for TBB and overhead per "work item" was around 500-600 cycles, so 10,000 cyles "work item" will result in ~5% overhead.

I think for OpenMP there is similar situation.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - m_enayah@hotmail.com

Hi all,

I would like to decide on which threading method to use. I need to know few things...

First: Can OpenMP 3.0or TBB parallelize functions level?

Second: How can I know if a function is large enough to be parallelized? Is there any scale or a method to know?

Thanks for the reply. What I mean by function level is that I need to have some control and data dependecy. Like Factorial Fucntion. Does OpenMP or TBB do that ?

Quoting - m_enayah@hotmail.com

Thanks for the reply. What I mean by function level is that I need to have some control and data dependecy. Like Factorial Fucntion. Does OpenMP or TBB do that ?

Actual recursion has to be confined to a single thread, or it gets very complicated, with semaphores etc, which is subject to those considerations about how much work has to be done between dependency checks to make threading worth while.

Quoting - tim18

Quoting - m_enayah@hotmail.com

Thanks for the reply. What I mean by function level is that I need to have some control and data dependecy. Like Factorial Fucntion. Does OpenMP or TBB do that ?

Actual recursion has to be confined to a single thread, or it gets very complicated, with semaphores etc, which is subject to those considerations about how much work has to be done between dependency checks to make threading worth while.

What I have is a for loop inside this for loop i have 2 functions. Each function is independent from the other, after both functions do the process i will have 2 outputs which are needed for the next round of the loop. Can OpenMP 3.0 still do that or i should find another method like POSIX?

Thanks for your help guys, i appreciate it

Quoting - m_enayah@hotmail.com

What I have is a for loop inside this for loop i have 2 functions. Each function is independent from the other, after both functions do the process i will have 2 outputs which are needed for the next round of the loop. Can OpenMP 3.0 still do that or i should find another method like POSIX?

The task directive quoted above will take of assigning the functions to individual threads as effectively as pthreads.

Quoting - tim18

Quoting - m_enayah@hotmail.com

What I have is a for loop inside this for loop i have 2 functions. Each function is independent from the other, after both functions do the process i will have 2 outputs which are needed for the next round of the loop. Can OpenMP 3.0 still do that or i should find another method like POSIX?

The task directive quoted above will take of assigning the functions to individual threads as effectively as pthreads.

Now I understand, that should work for the example provided.

How about if the two functions using same array ( data ) ? What is the code for that ?

Quoting - m_enayah@hotmail.com

How about if the two functions using same array ( data ) ? What is the code for that ?

Shared data are OK for read. For write, the threads must operate on different cache lines in order to get satisfactory performance (avoid false sharing). If the threads operate on neighboring cache lines, it is important to use the affinity facilities of your OpenMP to map those threads as much as possible to the same last level cache. The threads can be given their own temporary copies of arrays by the various OpenMP private directives.

Best Reply

Quoting - m_enayah@hotmail.com

Now I understand, that should work for the example provided.

How about if the two functions using same array ( data ) ? What is the code for that ?

This is too general question to get concrete answer. It depends. If you want concrete answer/example, provide high-level code of your algorithm.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy Vyukov

This is too general question to get concrete answer. It depends. If you want concrete answer/example, provide high-level code of your algorithm.

Okay, Here is my source code. we have 2 portions (1and 2)inside the for loop each portion dont depend on each other ( but i use the same functions inside whereby these functions usesame global arrays in each fucntion )

each for loop here depends on the loop before, so wat i want to so is just to parallize the sections inside each loop

thnx

for

(round=1;round<=16;round++)

{

if (round %2!=0)

{

/// First portion

Expansion(mid);

//Performing expansion on mid

xor_oneE(17 - round);

//Performing XOR operation on expansion[48],z[48] to get xor1[48]

substitution();

//Perform substitution on xor1[48] to get sub[32]

permutation();

//Performing Permutation on sub[32] to get p[32]

xor_two(left);

//Performing XOR operation on left[32],p[32] to get xor2[32]

for(i=0;i<32;i++) left[i]=xor2[i]; //Dumping xor2[32] into left[32]

/////// Second portion

Expansion(mid);

//Performing expansion on mid

xor_oneE(round);

//Performing XOR operation on expansion[48],z[48] to get xor1[48]

substitution();

//Perform substitution on xor1[48] to get sub[32]

permutation();

//Performing Permutation on sub[32] to get p[32]

xor_three(right);

//Performing XOR operation on left[32],p[32] to get xor3[32]

for(i=0;i<32;i++) right[i]=xor3[i]; //Dumping xor3[32] into right[32]

}

else

{

// same as above

}

Without seeing all of the code it would be hard to say this for certain, but it looks as if

Expansion(mid); of the first part would conflict with Expansion(mid) of the second part when executed concurrently (and same for other part of if(round%2 != 0))

Similar situation with xor_oneE, substitution and permutations

There is no way for us to determine if these functions store results or temp data in the same (or different) locations.

What I suggest you do is to get pencil and paper out and diagram the unrolled loop as a time sequence. Look at what what code is orcan be made parallel-safe and pay attention to how the sequencing must be made (or can be re-written). Have time advancing to the right and what can be performed in parallel going down.

Try to increase the parallization if possible (this may require a rewrite or seperate internal functions).

Note, the primary reason for diagramming is to discover (as I suspect) that the tail-end of what you have in your original loop can be overlapped with the front-end of the next iteration. Your loop is relatively small (either 16 or 8 iterations depending on how you perform the unrolling). If this loop is the choke point in your application, then it will warrant additional work to make it optimal.

Once you have this chart then ask your self if the code is to be optimized for: a specific number of cores, a minimum number of cores, or an unknown number of cores 1:n. You may find it necessary to write multipl code paths depending on the number of available cores.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today