Task Parallelization

Task Parallelization

Hello firends,

    I have successfully done threading and data Parallelization. But, I am really intrested in task Parallelization. How will i do it? How will I start it?

Please guide me.

49 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Your question has to be addressed in the context of the operating system (OS). Task parallelization requires support from the OS and, to a lesser extent, from the compiler. The issues are largely independent of the programming language used and, therefore, of the compiler.

You might look into some of the more frequently advocated parallel tasking models, such as OpenMP task, or tbb:task, which are meant to apply in more than one OS, with support from the compiler. This might help you to clarify your question.

I want to specify specific core for executing task. how will i do it? and Is it possible in intel composer?

OpenMP and tbb have affinity setting options to limit execution to a specified group of cores. You might accomplish what you requested indirectly by selecting a thread ID, or more directly at a lower level (e.g. pthreads), but usually this limits the portability of your program without any compensating benefit.

Can any one give me any sample code for Task Parallelization? (Two Diffrent task e.g Suppose CPU Core 1 and CPU Core 2 is doing addition and CPU Core 3 and CPU Core4 doing Mutiplication Operation. Such type of Parallel tasking is required.)

Thanks, In advance!!!

Required by a homework assignment? Surely you have relevant course material. As you state it here, it makes little sense.

Hello TimP(intel)
Is it possible using Intel XE Composer as Swapnil told? that means two diffrent task parallelization using openmp.

In case you haven't looked at them, a search command such as
task example site:software.intel.com
should show several papers on a similar subject, including both openmp and tbb.

Hi guys,

I have Copy and peast following TBB code:
and I am trying to execute this code i got follwoing error:
#include < cstdio >
#include < cstdlib >
#include < cstring >
#include < ctype >
#include "parallel_reduce.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"

using namespace tbb;

// Uncomment the line below to enable the auto_partitioner
#define AUTO_GRAIN

struct Sum {
float value;
Sum() : value(0) {}
Sum( Sum& s, split ) {value = 0;}
void operator()( const blocked_range& range ) {
float temp = value;
for( float* a=range.begin(); a!=range.end(); ++a ) {
temp += *a;
}
value = temp;
}
void join( Sum& rhs ) {value += rhs.value;}
};

float ParallelSum( float array[], size_t n ) {
Sum total;
#ifndef AUTO_GRAIN
parallel_reduce( blocked_range( array, array+n, 1000 ),
total );
#else /* AUTO_GRAIN */
parallel_reduce( blocked_range( array, array+n ),
total, auto_partitioner() );
#endif /* AUTO_GRAIN */
return total.value;
}

//! Problem size
const int N = 1000000;

//! Number of threads to use.
static int MaxThreads = 4;

//! If true, print out bits of the array
static bool VerboseFlag = false;

//! Parse the command line.
static void ParseCommandLine( int argc, char* argv[] ) {
int i = 1;
if( i
< N; ++i ) {
input[i] = (float)(rand() % 10);
}

if( VerboseFlag ) {
printf(" Input: ");
for ( size_t i = 0; i < 7; ++i ) {
printf("%7.2f ", input[i]);
}
printf("...\n");
}

// Try different numbers of threads
for( int p=1; p<=MaxThreads; ++p ) {

task_scheduler_init init(p);

tick_count t0 = tick_count::now();
float output = ParallelSum(input, N);
tick_count t1 = tick_count::now();

if( VerboseFlag ) {
printf("Output: %7.2f\n", output);
}
printf("%2d threads: %.3f msec\n", p, (t1-t0).seconds()*1000);
}
return 0;
}

# icc -O2 -g -DNDEBUG sample2.c -ltbb
sample2.c(5): catastrophic error: cannot open source file "parallel_reduce.h"
#include "parallel_reduce.h"
^

compilation aborted for sample2.c (code 4)

How will i solve this error: I am new in Parallel Programming and Intel Composer.

Hi Sigehere,
Just Change that follwoing lines
#include "parallel_reduce.h"
#include "blocked_range.h"
#include "task_scheduler_init.h"
#include "tick_count.h"

With this lines:
#include "tbb/parallel_reduce.h"
#include "tbb/blocked_range.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tick_count.h"

I hope it will work.

>>...I want to specify specific core for executing task. how will i do it?

It is OS dependent. For example, on Windows platforms you need to execute a SetThreadAffinityMask Win32 API function.

It also could be done with OpenMP and in that case it will be OS independent. Of course if there is an OpenMP library for a C/C++ compiler that builds codes for that OS.

>>...and Is it possible in intel composer?

No. You need to use some OS dependent API.

Hi Sergey,
I am using Intel i7 (4 cores and 8 thread processor) with 8GB RAM and Ubuntu 12.04 Opearating System.
Can you explain me? How will I achive parallel Tasking on such system (i.e. I want to do two different operation on different cores of prcessor).
Please Explain me how will i do it.
I am new in parallel programming.

>>>I am using Intel i7 (4 cores and 8 thread processor) with 8GB RAM and Ubuntu 12.04 Opearating System.>>>

You need to consult Linux API which deals solely with thread scheduling and dispatching and it is accessible to the client programmer.
I would recommend you to read this chapter and later to read whole book :http://oreilly.com/catalog/linuxkernel/chapter/ch10.html
I do not know how in the Linux you are supposed to call System functions I mean those routines related to process scheduling.

I'd like to repeat a recommendation to start with OpenMP or TBB to learn enough to know whether you have a reason to dig into pthreads programming.
Note the number of tutorials on the general area, such as
https://computing.llnl.gov/tutorials/parallel_comp/

https://computing.llnl.gov/tutorials/pthreads/

hello friend,
Anybody have any idea about how to use CPU cache L1, L2..... in our multithreading application with Intel XE composer.

>>...Anybody have any idea about how to use CPU cache L1, L2...

Please take a look at Intel Manuals for more information located at: www.intel.com/content/www/us/en/processors/architectures-software-develo...

Hi Sergey Kostrov & Swapnil
Can you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache.

>> Sergey Kostrov
I have seen this link which contain many reference manual. can you tell me which is very useful for managing CPU cache in my program manually?

Thanks all of gays for giving response in advance!!!
:)

Quote:

Sigehere S. wrote:

Hi Sergey Kostrov & Swapnil
Can you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache.

>> Sergey Kostrov
I have seen this link which contain many reference manual. can you tell me which is very useful for managing CPU cache in my program manually?

Thanks all of gays for giving response in advance!!!
:)

As Sergey advised please consult Intel processor manuals.It is definitive source of knowledge for the programmers.
Please follow this link:http://blogs.msdn.com/b/oldnewthing/archive/2009/12/08/9933836.aspx

Hi everybody,

>>...I have seen this link which contain many reference manual. can you tell me which is very useful for managing CPU
>>cache in my program manually?

I will provide more details soon ( that is, chapters, pages, etc ).

Hi iliyapolak,
you have given following link,
>>Please follow this link:http://blogs.msdn.com/b/oldnewthing/archive/2009/12/08/9933836.aspx
It is useful for finding CPU cache size/limit. Thanks for that
But, I am interesting in using CPU cache in my own program manually or forcefully
Do you have any idea about it?

Hi Sergey Kostrov,
First Thanks for immediate response.
Yes, Sergey Kostrov Please give me some more detail as early as possible.
Because, I am new in Intel XE Composer and to read all document i need lots of time, I know it helpful for me but It required lots of time. I have read in which Intel Optimization manual it's really help full but it hadn't specified and manual CPU cache.
If you have already read this documents then, you can tell me which portion of manual is useful It's helpful for me.
Thnaks!!!

>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>>
C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that.
I will do some research on web in order to find useful information.

Quote:

iliyapolak wrote:

>>>But, I am interesting in using CPU cache in my own program manually or forcefully>>>
C/C++ are not cache-aware you need optimize your programs by yourself you can use Intel manuals for that.
I will do some research on web in order to find useful information.

Read this article :http://people.redhat.com/drepper/cpumemory.pdf

>>>But, I am interesting in using CPU cache in my own program manually or forcefully
Do you have any idea about it? >>>

It could be done with smart cache aware programming,that's mean that you need to find and exploit in your program spatial and temporal factors , very good candidate for this are arrays, sadly I can not help you more here I'm not an expert on CPU cache and its optimization.

Very good podcast by Scott Meyers :http://skillsmatter.com/podcast/home/cpu-caches-and-why-you-care

>>... I have read in which Intel Optimization Manual it's really help full but it hadn't specified and manual CPU cache...

Do you have April 2012 edition? I wonder if you looked at Chapter 7 'Optimizing Cache Usage'?

Here are a couple of tips:

- Warm your data before processing ( it is a very simple procedure / very helpful when some data were paged to a Virtual Memory paging file )
- Use a PREFETCH instruction ( it really improves performance when used in MemCpy or StrCpy functions ).

Here is a small example:
...
RTbool FastMemCopy128( RTvoid *pvDst, RTvoid *pvSrc, RTint iNumOfBytes )
{
...
RTint iPageSize = 4096;
RTint iCacheLineSize = 32;
...
for( RTint i = 0; i < iNumOfBytes; i += iPageSize )
{
RTint j;

for( j = i + iCacheLineSize; j < ( i + iPageSize ); j += iCacheLineSize )
{
_mm_prefetch( ( RTchar * )pvSrc + j, _MM_HINT_NTA );
}
...
}
...
return ( RTbool )bOk;
}

...

Thanks, I got it. I hope it will helpful to me.

>>Thanks, I got it. I hope it will helpful to me.

Please take a look at:

Forum topic: A problem with 'prefetcht0' instruction ( AT&T inline-assembler syntax )
Web-link: http://software.intel.com/en-us/forums/topic/280798

Also, try to search the Intel forums with a key-word prefetch because there were lots of discussions in the past on that subject.

>>Also, try to search the Intel forums with a key-word prefetch
Sure, I will read, thanks Sergey.

Hi Sergey,
I have writen one simple code and implement _mm_prefetch function, the objective of my sample code is just find the aggragte sum of array
CODE:
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#define SIZE 400000000
#define LOOP 20000

struct timeval starttime;
// start time function implemantation
void startTimer()
{
gettimeofday(&starttime,0);
}
// end time function implemantation
double endTimer()
{
struct timeval endtime;
gettimeofday(&endtime,0);

return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0;
}

int main ()
{
long int sum=0;
int *A = (int *)malloc(sizeof(int)*SIZE);
int i,j;
for(i=0;i<SIZE;i++)
{
A[i]=1;
}

startTimer();
#pragma omp parallel for reduction (+:sum)
for(i=0;i<LOOP;i++)
{
_mm_prefetch(&A[(i+1)*LOOP],3);
for(j=0;j<LOOP;j++)
{
sum += A[(i*LOOP)+j];
}
}
printf("Result = %ld", sum);
printf("Total Time Required = %lf ms\n",endTimer());

return 0;
}

#shell script:
icc -O1 -openmp sum.c -o sum_O1
icc -O2 -openmp sum.c -o sum_O2
icc -O3 -openmp sum.c -o sum_03
icc -O -openmp sum.c -o sum_O
icc -Os -openmp sum.c -o sum_Os
icc -O0 -openmp sum.c -o sum_O0
icc -fast -openmp sum.c -o sum_fast
icc -Ofast -openmp sum.c -o sum_Ofast
icc -fno-alias -openmp sum.c -o sum_fno_alias
icc -fno-fnalias -openmp sum.c -o sum_fno_fnalias

But required time after using _mm_prefetch is same as required time before using _mm_prefetch.
Is there any option missing in {icc} command.
can you tell me where i am missing some thing?

I am using Ubuntu 12.04 (Intel i7/8GB RAM)
more specification about my processor is as follow link
http://ark.intel.com/products/64899/Intel-Core-i7-3610QM-Processor-6M-Ca...

can you give me any suggesion to improve this code?

Thanks.

I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well.
Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.

Hi everybody,

>>...
>>#pragma omp parallel for reduction (+:sum)
>>for(i=0;i>{
>>_mm_prefetch(&A[(i+1)*LOOP],3);
>>for(j=0;j>{
>>sum += A[(i*LOOP)+j];
>>}
>>}
>>...

Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good.

We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.

Quote:

TimP (Intel) wrote:

I'm not surprised if mm_prefetch makes little difference here. You are depending primarily on automatic hardware prefetch, if you haven't turned it off (in BIOS setup or by MSR), and hardware prefetch ought to do the job well.
Nit picks: if you run on a multiple socket platform, one of your prefetches appears to prefetch to the wrong CPU. When that doesn't happen, you may accelerate the first cache line for the next inner loop but delay the effectiveness of hardware prefetch.

That means, we can not use Hardware as well as software prefetch in same application. If, we disable Hardware prefetch from BIOS then it will do effective job obvious, i agree with you.

That means there is no way to increase more optimization with using h/w & s/w prefech at same application.

suppose in my above code : When inner loop is starting to execute then, CPU not cache any data in CPU Cache Memory . Is it right or wrong? If we apply or tell to prcessor on next loop execution next data is required to process then it will help full to reduce latency.
I am trying to reduce latency of memory access.
If I am thinking on wrong direction then please tell me how this code executed in CPU.
Thanks

Quote:

Sergey Kostrov wrote:

Hi everybody,

>>...
>>#pragma omp parallel for reduction (+:sum)
>>for(i=0;i
>>{
>>_mm_prefetch(&A[(i+1)*LOOP],3);
>>for(j=0;j
>>{
>>sum += A[(i*LOOP)+j];
>>}
>>}
>>...

Please take a look at a partial example of FastMemCopy128 function which I posted a couple of days ago. You're using _mm_prefetch in a different way and it doesn't look good.

We constantly have discussions on applications and usefulness of _mm_prefetch intrinsic function or prefetch instruction ( as inline assembler in C/C++ codes ). Since Intel invented it prefetch should work. However, it has to be applied and used properly. Your case is more complex because _mm_prefetch intrinsic function is used inside of OpenMP clause ( is that the reason of the problem? ) and I never tried to do the same.

Hi Sergey,
Do you get good result with FastMemCopy128 can you post any full sample code. to study it.
because in my code i don't get any good result.
Please give me any sample code with memory optimization.
Thanks
thanks

@Tim
If implementing SoA or hybrid SoA approach to the data layout where it is applicable coupled with the prefetch instruction it is interesting how much such a approach could improve performance.

@Sighere

Please try to use SoA and hybrid SoA approach to your data layout.This data layout is more effective for the nicely vectorized data input like 3D or 4D vectors, but you can try it on your data set.Below is very interesting link.
http://software.intel.com/en-us/articles/how-to-manipulate-data-structur...

The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here.

If you wish to combine hardware and software prefetch, you might start by examining what icc does with options such as
-xHost -ansi-alias -openmp -opt-prefetch -opt-report
If your objective is to cover the initial iterations of a loop by software prefetch, the category under which icc implements that for certain targets is called initial value prefetch. The original Pentium 4 presented some cases where this effect could be accelerated by methods resembling what is shown in this thread, but the interaction of software and hardware prefetch was generally bad and the characteristics of hardware prefetch had to be changed. The point well taken is that the hardware prefetch doesn't become effective until the loop has traversed several cache lines, but that problem should be negligible in a case as large as this.

>>...Do you get good result with FastMemCopy128 can you post any full sample code...

I could post test results with and without prefetch to demonstrate that it works and improves performance.

>>>The sample which was presented appears to conform to the preferred organization of a stride 1 inner vectorizable loop inside an outer parallel loop. I don't see the relevance of discussing array of structures here>>>

For @Sighere example SoA is not applicable, but it is for vectorized data set like for example float coordinates of the vertices and float coordinates of the light sources.
When allocating memeory for large data sets of the vertices I think prefered option will be hybrid SoA because of the structures beign located in close vicinity from each other.

Note: SoA stands for Structure of Arrays

>>...SoA is not applicable...

As you can see in Sighere's example just one block of memory for 1-D array is created:

>>...
>>int *A = ( int * )malloc( sizeof( int ) * SIZE );

Quote:

Sergey Kostrov wrote:

Note: SoA stands for Structure of Arrays

>>...SoA is not applicable...

As you can see in Sighere's example just one block of memory for 1-D array is created:

>>...
>>int *A = ( int * )malloc( sizeof( int ) * SIZE );

SoA layout usage makes more sense in perfectly vectorised data sets.

>>>SoA layout usage makes more sense in perfectly vectorised data sets.>>>

Altough in Sighere's case I think that for the sake of curiosity hybrid SoA approach could be tested.Packing his data into SoA aligned on 16-bytes boundaries and designing it as 3D or 4D vectores and filling 1D array with such a structures maybe such a data set design could improve performance of the CPU cache.

>>...SoA layout usage makes more sence...

Is it relevant to the problem with application of a 'prefetch' instruction in a piece of code ( with OpenMP clause posted by Sighere ) which calculates a sum of elements of a large vector?

Quote:

Sergey Kostrov wrote:

>>...SoA layout usage makes more sence...

Is it relevant to the problem with application of a 'prefetch' instruction in a piece of code ( with OpenMP clause posted by Sighere ) which calculates a sum of elements of a large vector?

I simply stated that SoA approach is more relevant to perfectly vectorised data for example like float coordinates of the vertices.

>>>>...Do you get good result with FastMemCopy128 can you post any full sample code...
>>
>>I could post test results with and without prefetch to demonstrate that it works and improves performance.

Finally I managed to complete these tests and a couple of posts with test results will be posted soon...

[ Test 1 with _mm_prefetch ]
[ Vector size is 1024KB ( 1MB ) ]

Data Size: 1048576 bytes
Test-Case 2b [ CrtMemcpy ] completed in 2142 ticks
Test-Case 3a [ FastMemCopy128 ] completed in 1344 ticks Note: 9.3% faster than FastMemCopy128 without _mm_prefetch

[ Test 2 without _mm_prefetch ]
[ Vector size is 1024KB ( 1MB ) ]

Data Size: 1048576 bytes
Test-Case 2b [ CrtMemcpy ] completed in 2141 ticks
Test-Case 3a [ FastMemCopy128 ] completed in 1469 ticks

[ Test 3 with _mm_prefetch ]
[ Vector size is 4096KB ( 4MB ) ]

Data Size: 4194304 bytes
Test-Case 2b [ CrtMemcpy ] completed in 8484 ticks
Test-Case 3a [ FastMemCopy128 ] completed in 5468 ticks Note: 7.7% faster than FastMemCopy128 without _mm_prefetch

[ Test 4 without _mm_prefetch ]
[ Vector size is 4096KB ( 4MB ) ]

Data Size: 4194304 bytes
Test-Case 2b [ CrtMemcpy ] completed in 8485 ticks
Test-Case 3a [ FastMemCopy128 ] completed in 5891 ticks

Here are results from two tests which demonstrate performance improvements when unrolling of loops is applied:

[ Test 5 for Rolled and Unrolled loops ]
[ Vector size is 256KB ]

Data Size: 262144 bytes
Test-Case 2a [ 1-in-1 ] completed in 6531 ticks
Test-Case 2a [ 4-in-1 ] completed in 3187 ticks
Test-Case 2a [ 8-in-1 ] completed in 3094 ticks

[ Test 6 for Rolled and Unrolled loops ]
[ Vector size is 1024KB ( 1MB ) ]

Data Size: 1048576 bytes
Test-Case 2a [ 1-in-1 ] completed in 20015 ticks
Test-Case 2a [ 4-in-1 ] completed in 12469 ticks
Test-Case 2a [ 8-in-1 ] completed in 11625 ticks

And one more thing to mention... For tests 1, 2, 3 and 4 with CrtMemcpy and FastMemCopy128 functions dynamically allocated memory was alligned to 16-byte boundary.

Leave a Comment

Please sign in to add a comment. Not a member? Join today