openmp question

openmp question

Hello
icpc -v 10.0 MacOSX10.4

double *const opt = new double[...];
#pragma omp parallel shared(opt)
{
opt[threadSpecifcIndex] = value;
cout<< opt[threadSpecifcIndex] ;
}

Once one thread writes a value to an element of opt (threadSpecifcIndex is never the same for 2 different threads), is the same thread supposed to read that array element and find the same value it has written... There is no "flushing" required?

The behavior seems different depending on the optimization level. (no optim vs -O3)
rds,

34 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In this example, you have no control over the order in which the results go to stdout, so it will vary with optimization, from one run to another, etc. If you sorted the results by thread, you should not see any variation.

Thanks. Please let me rephrase myself, clarify the problem and expand the question to many:

double *constA = new double[ nbrThreads*N+ 1 ];

1. I cannot allocate A simply on the stack because the number of threads and Nare determined only at runtime. N is of the order of 500000. Otherwise, I could have done
double A[N][nbr of omp threads];
Below, I will access A linearily as A[i*N+thread] instead of A[i][thread]
I wonder whether there will be a perf difference between a stack allocation and a heap allocation?

2.
int *const timestep = new int[nbrThreads];
int *const spacestep = new int[nbrThreads];
the threads in the // region will read the position reached by other threads (on which they depend for values) by reading timestep[otherthread] and spacestep[otherthread].

3.
#pragma omp parallel shared(opt,timestep,spacestep)
If i understand the openmp spec, by default, all variables in the scope before the start of the // region are shared anyways, unless one adds default(none) ?
I still like to write the shared clause, because it shows me clearly the variables i intend to share. The pointers opt,timestep,spacestep are shared, any element of these arrays is always written by exactly 1 thread, never more, but it is read by other threads.

#pragma omp parallel shared(opt,timestep,spacestep)
{
const int thr = omp_get_thread_num();
const int dependent = (thr+1)%nbrThreads; // thread which depends on current thread
const int dependee = (thr+nbrThreads-1)%nbrThreads; // thread on which this thread depends
int& ss = spacestep[thr];
int& ssdependent = spacestep[dependent];
int& ssdependee = spacestep[dependee];
int& ts = timestep[thr];
int& tsdependent = timestep[dependent];
int& tsdependee = timestep[dependee];

All these variables are thread-local. Each thread is both a dependent

while (ts>=0) { // Back through time
for (ss=0; ss<=N; ss++) { // down across space

// wait until the dependee has finished with the nodes this thread needs
while ( tsdependee>ts && ssdepen
dee<=ss+1 );
// wait until thedependent has used our nodes it needs beforewe overridethemat next timestep
while ( tsdependent>ts && ssdependent<=ss );
....
// proceed with work
}
ts -= nbrThreads;
}

spacestep and timestep are updated for current thread via the references ss and ts.
Other threads, at some point, will see new values. No deadlock occurs.

The above works in Debug mode but not in Release with optimization. I suspect the waiting doesn't happen. Perhaps the while() gets optimized away?
In practise, not much time is spent in those loops. Is there still a point with making the thread sleep for some time. In openmp, how can that be done?

Thanks,

finjulhich,

Make the controlarrays spacestep and timestep volatile. (the contents of the arrays may change without notice).

Also, verify that your program is not performing usless work while waiting for other threads to complete.

Your particular style will (may) work well if each thread takes equal amount of time to finish. This is not always the case as there are other factors beyond the control of the programmer. Principal among these is the operating system may run something else during your program run time. You may need to consider dynamic scheduling of the work

Also, by having each thread take adjacent array locations you inhibit the ability of the compiler to generate vector instructions (e.g. work on two of your doubles at a time).

And cach lines also reference adjacent packets of memory e.g. 64 bytes (8 doubles) so your threads will be evicting the cache of the other threads.

Your program may run significantly faster if you reorganize how you split up the work.

Jim Dempsey

www.quickthreadprogramming.com

Thanks for your answers.

>finjulhich,
>Make the controlarrays spacestep and timestep volatile. (the contents of the arrays may >change without notice).
I changed to

volatile int* timestep = new volatile int[nbrThreads];
volatile int* spacestep = new volatile int[nbrThreads];
...
#pragma omp parallel shared(opt,timestep,spacestep)
{
volatile int& ss = spacestep[thr];
volatile int& ssdependent = spacestep[dependent];
volatile int& ssdependee = spacestep[dependee];
volatile int& ts = timestep[thr];
volatile int& tsdependent = timestep[dependent];
volatile int& tsdependee = timestep[dependee];
...

And the wrong result in optimized mode still happens. More precisely, I had

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 )
os<<"ss="<os << ts< while ( tsdependent>ts && ssdependent<=ss )
os << " ss="<os << ts< // do work
}
ts-= nbrthreads;
}

os is a thread-specific file output stream. With these printouts, even in -O3, it works fine.
If i comment out the 2nd printout after the first while, then it fails.

>Also, verify that your program is not performing usless work while waiting for other >threads to complete.Your particular style will (may) work well if each thread takes equal >amount of time to finish. This is not always the case as there are other factors beyond the >control of the programmer. Principal among these is the operating system may run >something else during your program run time. You may need to consider dynamic >scheduling of the work...
I have a different, working algorithm that splits up the work vertically, instead of alternating each time step, but it has #pragma omp barrier at the frontiers.
I wanted to see if this time-step alterning thread solution was faster.
In practice, no other CPUintensive processes will be running at same time. at least for now.

>Also, by having each thread take adjacent array locations you inhibit the ability of the >compiler to generate vector instructions (e.g. work on two of your doubles at a time).
Given the nature of the algo, it is likely threads will never be far from each other in the space level (i'm not sure about this)
That is why i thought to place adjacent doubles from consecutive threads.

>And cach lines also reference adjacent packets of memory e.g. 64 bytes (8 doubles) so >your threads will be evicting the cache of the other threads.
basically, 1 thread node (t,s) will = previous thread node (t,s)*const0 + previous thread node (t,s+1)*const1
Ideally, i would get the 3 nodesin the cache line.(actually itwould happen onlyif the threads are synced perfectly)

>>Your program may run significantly faster if you reorganize how
you split up the work.
That is the algo i started with first,

Jim Dempsey

After some more investigation,

while ( tsdependee>ts && ssdependee<=ss+1 );
os << std::endl;
while ( tsdependent>ts && ssdependent<=ss );

removing os << std::endl; in -O3 triggers the failure. Also, os<

Seems the flushing makes the difference....

As a test, temporarily do

#define Debug_os

#ifdef Debug_os
// testing potential rtl problem with os<<...
#else
os << ...
#endif

You may have a problem with using a non-multi-threaded version of the string template library. i.e one that uses a static formatting buffer as opposed to a stack based formatting buffer.

In your code:

  while ( tsdependee>ts && ssdependee<=ss+1 )
os<<"ss="<

If the above while performs the statement following how does it break out?

Does this imply << modifies one or more of the test variables in the while statement?

As a test for the compiler optimization issue, introduce some "noise" in to your program. Something that will thwart the compiler from joining code fragments.

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 )
{

os<<"ss="<if(ss==0xF000F000) // impossible index
exit();
  }
os << ts< while ( tsdependent>ts && ssdependent<=ss )
  {

os << " ss="<if(ss==0xFEEEFEEE) // impossible index
exit();
     }

os << ts< // do work
}
ts-= nbrthreads;
}

You might find you only need the {}'s around the statement following the while.

Jim Dempsey

www.quickthreadprogramming.com

>>You may have a problem with using a
>>non-multi-threaded version of the string template library.
I doubt there is an issue there... I put the output stream there just for "debugging". Please let me show why:

In non optimized mode (no -O), all works fine.

In optimized mode, this (os is std::ofstream, just to see what's going on) works

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );
os << std::endl;
// work
}
ts -= nbrthreads;
}

but this (which is how i want it to be, no os output)

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );

// work
}
ts -= nbrthreads;
}

nor this (no endl in os output, also just to see what's going on)

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );



os << 5;
// work
}
ts -= nbrthreads;
}

do.

>>In your code:

  while ( tsdependee>ts && ssdependee<=ss+1 )
os<<"ss="<

>>If the above while performs the statement following how does it break out? Does this imply << modifies one or more of the test variables in the while statement?

The while loop is really the synchronization attempt. ts and ss are refs to local thread entry in the shared array.volatiletsdependee and ssdependee will be changed by thread dependee (hopefully in a short time), so current thread should see new values and break out of the while loop.

I would have done

while( tsdependee>ts && ssdependee<=ss+1 )
// some openmp statment for making thread sleep some usecs

if i could.

PS: I tried

while (ts>=0) {
for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
while ( tsdependee>ts && ssdependee<=ss+1 ){}
while ( tsdependent>ts && ssdependent<=ss ){}

// work
}
ts -= nbrthreads;
}

but that's the same.

Thanks Jim,

finjulhich,

I would venture to guess that the compiler is registerizing one or more of tsdependee, ssdependee, tsdependent, ssdependent, and that the use of os <<... told the compiler that it could not rely on the preserved values of the regisrters and therefore it generated code to look at the memory values instead. Try inserting a call to a dummy function that simply returns *** but which is not within the scope of gloabal optimizations. That is you do not want the optimizer to examine the dummy code to determine which registers are preserved across the call.

 while ( tsdependee>ts && ssdependee<=ss+1 ){DummyFunction();}
while ( tsdependent>ts && ssdependent<=ss ){DummyFunction();}

You could confirm this hypothesis by placing a break on the 1st while (before you add the DummyFunction) then open a Dissassembly window. Then examine to see if the while clause is referencing registers or memory for any of the shared variables. If the optimized code is rearranged too much to make sense of then trythe following

for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
DummyFunction();
while ( tsdependee>ts && ssdependee<=ss+1 ){}
while ( tsdependent>ts && ssdependent<=ss ){}
DummyFunction();

You should be able to see the calls to DummyFunction. The code that is sandwidched between the calls should be examined. Only ts and ss should be candidates for registerization, the other four must not be registerized. Note, it would be permitted for the while statement to move the contents of memory for one of the shared variables into a register and then immediately compare it's value. The keyword volatile should prevent this.

If you have problems deciphering this then send the snippet of code to the forum.

Jim Dempsey

www.quickthreadprogramming.com

Thanks Jim,

I volatilized opt, timestep and spacestep arrays. It still fails unfortunately.

The compilation line is

icpc -xT -O3 -ipo -no-prec-div -openmp -parallel -fp-model fast=2 -fpic -fvisibility=hidden -I../api -I../common -DFINH_DYNAMICLIB="__attribute__((visibility("default")))" -c -o Tree.o Tree.cpp

Attached is1 file Tree.h that contains the implementation of the method. It is included from Tree.cpp

rds,

Attachments: 

AttachmentSize
Download tree.h3.24 KB

I don't see the file attached....

Insert the call to a dummy function ( or a system function such as Sleep(0);) before and after the two while statements. Set break on 1st Sleep then run to break point. Then Click

Debug | Windows | Dissassembly

Select text in dissassembly window from first call to Sleep through 2nd call to sleep. Then paste into forum message for review. Also verify that the problem occures with Sleep(0); in the code.

If problem persists I will assist in examining the assembly code.

Jim Dempsey

www.quickthreadprogramming.com

Finjulhich,

I haven't fully analyzed your code however I do have a observation.

Your master thread (thr==0) is not available to perform work in the major while loop. Is this by design?

Jim Dempsey

www.quickthreadprogramming.com

Jim,

I don't see why you conclude that.

#pragma omp parallel shared(opt,timestep,spacestep)
{

if (thr==0) {
for (ss=0; ss<=itmIdx; ss++, SSi*=fr)
opt[ss*nbrThreads+0] = cT::payoff(SSi,X);
ts -= nbrThreads;
Sinit *= ft;
}

while (ts>=0) {
. ....
}

}

The block inside the if(thr==0) gets executed only in the 0 thread, but the big while loop gets run in all threads.

PS: i will add sleep() and locate the assembly, thank you so much for your assistance,

I'm writing this on vista64 C2Duo.... I don't know the equivalent of a sleep(), there's no unistd.h so I put a dummy function from another translation unit.
I compiled Debug mode with full optimization like in Release mode. and I debugged in Debug mode.

while (ts>=0) {

00000001800CCD26 mov rax,qword ptr [ts]

00000001800CCD2D mov eax,dword ptr [rax]

00000001800CCD2F test eax,eax

00000001800CCD31 jl 00000001800CD11F

const int spacestepmax = itmIdx

00000001800CCD37 mov rax,qword ptr [rbp+1A8h]

00000001800CCD3E mov eax,dword ptr [rax]

00000001800CCD40 mov edx,dword ptr [rbp+238h]

00000001800CCD46 cmp edx,eax

00000001800CCD48 jl 00000001800CC9CC

00000001800CCD4E mov rax,qword ptr [rbp+1A8h]

00000001800CCD55 mov eax,dword ptr [rax]

00000001800CCD57 mov dword ptr [rbp+234h],eax

00000001800CCD5D mov eax,dword ptr [rbp+234h]

00000001800CCD63 mov dword ptr [rbp+21Ch],eax

for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {

00000001800CCD69 mov rax,qword ptr [rbp+190h]

00000001800CCD70 mov dword ptr [rax],0

00000001800CCD76 fld qword ptr [rbp+180h]

00000001800CCD7C fstp qword ptr [rbp+188h]

00000001800CCD82 mov rax,qword ptr [rbp+190h]

00000001800CCD89 mov eax,dword ptr [rax]

00000001800CCD8B mov edx,dword ptr [rbp+21Ch]

00000001800CCD91 cmp eax,edx

00000001800CCD93 jle 00000001800CCDDA

00000001800CCD95 jmp 00000001800CD03E

00000001800CCD9A mov rax,qword ptr [rbp+190h]

00000001800CCDA1 mov eax,dword ptr [rax]

00000001800CCDA3 add eax,1

00000001800CCDA6 mov rdx,qword ptr [rbp+190h]

00000001800CCDAD mov dword ptr [rdx],eax

00000001800CCDAF fld qword ptr [rbp+188h]

00000001800CCDB5 fld tbyte ptr [rbp+148h]

00000001800CCDBB fmulp st(1),st

00000001800CCDBD fstp qword ptr [rbp+188h]

00000001800CCDC3 mov rax,qword ptr [rbp+190h]

00000001800CCDCA mov eax,dword ptr [rax]

00000001800CCDCC mov edx,dword ptr [rbp+21Ch]

00000001800CCDD2 cmp eax,edx

00000001800CCDD4 jg 00000001800CD03E

globalf();

00000001800CCDDA call globalf (1800010C3h)

while ( tsdependee>ts && ssdependee<=ss+1 ){}

00000001800CCDDF mov rax,qword ptr [rbp+1B8h]

00000001800CCDE6 mov eax,dword ptr [rax]

00000001800CCDE8 mov rdx,qword ptr [rbp+1A8h]

00000001800CCDEF mov edx,dword ptr [rdx]

00000001800CCDF1 cmp eax,edx

00000001800CCDF3 jle 00000001800CCE3D

00000001800CCDF5 mov rax,qword ptr [rbp+1A0h]

00000001800CCDFC mov eax,dword ptr [rax]

00000001800CCDFE mov rdx,qword ptr [rbp+190h]

00000001800CCE05 mov edx,dword ptr [rdx]

00000001800CCE07 add edx,1

00000001800CCE0A cmp eax,edx

00000001800CCE0C jg 00000001800CCE3D

00000001800CCE0E mov rax,qword ptr [rbp+1B8h]

00000001800CCE15 mov eax,dword ptr [rax]

00000001800CCE17 mov rdx,qword ptr [rbp+1A8h]

00000001800CCE1E mov edx,dword ptr [rdx]

00000001800CCE20 cmp eax,edx

00000001800CCE22 j le 00000001800CCE3D

00000001800CCE24 mov rax,qword ptr [rbp+1A0h]

00000001800CCE2B mov eax,dword ptr [rax]

00000001800CCE2D mov rdx,qword ptr [rbp+190h]

00000001800CCE34 mov edx,dword ptr [rdx]

00000001800CCE36 add edx,1

00000001800CCE39 cmp eax,edx

00000001800CCE3B jle 00000001800CCE0E

while ( tsdependent>ts && ssdependent<=ss ) {}

00000001800CCE3D mov rax,qword ptr [rbp+1B0h]

00000001800CCE44 mov eax,dword ptr [rax]

00000001800CCE46 mov rdx,qword ptr [rbp+1A8h]

00000001800CCE4D mov edx,dword ptr [rdx]

00000001800CCE4F cmp eax,edx

00000001800CCE51 jle 00000001800CCE95

00000001800CCE53 mov rax,qword ptr [rbp+198h]

00000001800CCE5A mov eax,dword ptr [rax]

00000001800CCE5C mov rdx,qword ptr [rbp+190h]

00000001800CCE63 mov edx,dword ptr [rdx]

00000001800CCE65 cmp eax,edx

00000001800CCE67 jg 00000001800CCE95

00000001800CCE69 mov rax,qword ptr [rbp+1B0h]

00000001800CCE70 mov eax,dword ptr [rax]

00000001800CCE72 mov rdx,qword ptr [rbp+1A8h]

00000001800CCE79 mov edx,dword ptr [rdx]

00000001800CCE7B cmp eax,edx

00000001800CCE7D jle 00000001800CCE95

00000001800CCE7F mov rax,qword ptr [rbp+198h]

00000001800CCE86 mov eax,dword ptr [rax]

00000001800CCE88 mov rdx,qword ptr [rbp+190h]

00000001800CCE8F mov edx,dword ptr [rdx]

00000001800CCE91 cmp eax,edx

00000001800CCE93 jle 00000001800CCE69

globalf();

00000001800CCE95 call globalf (1800010C3h)

With the volatilization of opt[] also, it's much

harder to reproduce th error on mac Xeon5100 2 dualcores

And I failed totally to reproduce this so far on single Core2Duo 4300

so far.

As, even with the volatile, i reproduced it at least once in optim mode,

I doubt the problem is gone.

THanks,

The generated code sequence for those two while statements is strange. For each while statement it performs the test twice. The code may run OK maybe it is a loop unrolling thing.

I think there is a race condition going on with your code. Will look closer.

Jim

www.quickthreadprogramming.com

Thanks, here is the schematics of the algorithm

0 1 2 N

X X X ...... X 0
X X X 1
X X 2
..........
XXXX itmIdx
0 0 0
0

ts aka timestep backwards from N to 0, and ss aka spacestep goes from 0 to either itmIdx or N. Thread 0 take the N column, thread 1the N-1 column and so on. And the
thread 0 takes the N-nbrthreads column and so on.

Each node depends on the node at the rightcolumn and same row, and right col and row below, and 2 constants p0,p1

rds,

Here is a potential problem with the algorithm:

a) for(ss=0 loop exits with ss > spacestepmax
b) you exit the loop and decrement ts
c) Note, ss still greater than 0 (it's > spacestepmax)

Momentarily it will look like you've completed your space step loop for the new ts (at least until you begin the for(ss=0 loop).

Perhaps you could consider something like:

const int spacestepmin=0;
while (ts>=0) {
const int spacestepspan = itmIdx const int spacestepmax = spacestepmin + spacestepspan;
for (ss=spacestepmin, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
...
}
Sinit *= ft;
spacestepmin += spacestepspan;
ts -= nbrThreads;
}

Run test to check for fix of problem

Then consider omitting ts in the two while statemenst as ss is a composite of ts and former ss.

Jim Dempsey

www.quickthreadprogramming.com

Additional comments:

Programming a compute intensive wait loop as done with your code is counter productive. Intuitively your current code {} may look like the lowest latency to respond to other thread(s) completing a step. The problem is your synchronization variables are or tend to be contained within the same cache line. There are performance issues relating to cache line sharing.

Additionally, your code assumes it is the only thing running on your system. Unless this code is running on a system without an operating system (e.g. an embedded system), other things are running on the system. Therefore, you should code in a manner that takes into consideration that other code may intervene with your code and your coding practice causes a delay of the other code completion and thus the causes a delay in the completion of a step in your code.

Consider the following changes:

const int SpinMax = 250;// pick a reasonable number for spincount
int SpinCount;

#ifdef _DIAGNOSTIC
static int MaxSpinCountDependee = 0;
struct BlockMin128
{
union
{
int i;
char c[128];
};
};

...
volatile BlockMin128 *const timestep = new volatile BlockMin128[nbrThreads];
volatile BlockMin128 *const spacestep = new volatile BlockMin128[nbrThreads];
...
timestep[t].i = N-t;
spacestep[t].i = 0;
...
volatile int& ss = spacestep[thr].i;
volatile int& ssdependent = spacestep[dependent].i;
volatile int& ssdependee = spacestep[dependee].i;
volatile int& ts = timestep[thr].i;
volatile int& tsdependent = timestep[dependent].i;
volatile int& tsdependee = timestep[dependee].i;

// inner wait loops
static int MaxSpinCountDependent = 0;
#endif

SpinCount = 0;
while ( tsdependee>ts && ssdependee<=ss+1 )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependee =
max(MaxSpinCountDependee,SpinCount);
#endif

SpinCount = 0;
while ( tsdependent>ts && ssdependent<=ss )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependent =
max(MaxSpinCountDependent,SpinCount);
#endif

Jim Dempsey

www.quickthreadprogramming.com

A paste got in the wrong place.

const int SpinMax = 250;// pick a reasonable number for spincount
int SpinCount;

#ifdef _DIAGNOSTIC
static int MaxSpinCountDependee = 0;
static int MaxSpinCountDependent = 0;
#endif

struct BlockMin128
{
union
{
int i;
char c[128];
};
};

...
volatile BlockMin128 *const timestep = new volatile BlockMin128[nbrThreads];
volatile BlockMin128 *const spacestep = new volatile BlockMin128[nbrThreads];
...
timestep[t].i = N-t;
spacestep[t].i = 0;
...
volatile int& ss = spacestep[thr].i;
volatile int& ssdependent = spacestep[dependent].i;
volatile int& ssdependee = spacestep[dependee].i;
volatile int& ts = timestep[thr].i;
volatile int& tsdependent = timestep[dependent].i;
volatile int& tsdependee = timestep[dependee].i;

// inner wait loops

SpinCount = 0;
while ( tsdependee>ts && ssdependee<=ss+1 )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependee =
max(MaxSpinCountDependee,SpinCount);
#endif

SpinCount = 0;
while ( tsdependent>ts && ssdependent<=ss )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependent =
max(MaxSpinCountDependent,SpinCount);
#endif

www.quickthreadprogramming.com

Reply to your 11-13-2007, 6:05 AM post,

Yes, as spacestep and timestep are separate, there will be always be a possiblity that other threads see a situation when one has been updated but not the other.

The above didn't work because i think the situation can still happen.

I am now encoding both spacestep and timestep in 1 int variable (location) like this:

. . . itmIdx+1 0
. . . . 1
. . . . 2
.
. . . 2itmIdx itmIdx

starting from 0 to an end number atnode (0,0). A complication is that the algo has 2 parts, 1 rectangular and the other triangular. It's a trapeze basically.

I just need to find out how to decode the number to timestep and spacestep.

Once i get that running correctly, i will study your following posts.

PS: on windows, i can't use sleep() .... any idea which other primitive? no omp runtime lib function can help ?

A million thanks,

Actually, a much easier encoding of my location is location = (1+itmIdx)*ts + ss.

Finjulhic,

The C++ rtl function is Sleep(0) not sleep(0)

In the sample code I sent your wait process iterates a tunable number of times (250) issuing _mm_pause(); and if the stall takes longer than that it performs Sleep(0);

_mm_pause(); (an intrinsic on MSVC which may be available on Intel C/C++) issues the equivelent of

_asm pause

The pause instruction prevents a performance penalty for not only the waiting thread but also for the other threads which are working to catch up to the ts&ss phase.

The Sleep(0) is "No sleep time but run scheduler" i.e. you release the remaining of your time slice.

Now you may ask "Why would I want to release my time slice in a compute bound application?". The answer to this is:

If the division of work is such that all steps are approximately equal in duration then the wait time will be less than the SpinCountMax (250 in sample) number of iterations and therefore under normal circumstances you will not issue the Sleep(0). However, if the operating system runs something else (it always is) and you encounter a situation where you exceed SpinCountMax number of iterations then it is likely that you have arrived at a situation where if you continue to run your wait loop you will delay the completion of the other work from other processes scheduled and thus delay the resumption of your code. Therefore, adding Sleep(0) may increase the performance of your application.

This said, you may also have to add something to avoid the Sleep(0) once any ofyour threads encounters a Sleep(0). A simple way would be to have what ammounts to be a One-Shot reset value for SpinCountMax. e.g. If you enter the Sleep(0) section you set the One-Shot reset value to 2500. Before the wait loop, you set the SpinCountMax from the One-Shot reset then reset the One-Shot reset value back to your default SpinCountMax (250). The correct values can be determined from using the diagnostic code in the sample I posted.

Yes, this is some additional work. But this extra work is relatively easy to integrate into your application andhas payback. Example, assume while you are running your applicaiton in production mode, you want to run a quick test sample using the same (or similarly written) application. Written as-was, a large amount of time would have been wasted running those null while loops.

>>I just need to find out how to decode the number to timestep and spacestep.<<

Change your ts coded valueto increment (both ts and ss advance in sequence) encode the two into two bitfields of appropriate widths. Then both ts and ss can be set at the same time. More importantly ts(n):ss(0) is < ts(n-1):ss(last). You can continue to use your negative indexing by changing your member function that obtains ts (return N-bit field).

Use a size that is either 32-bits or 64-bits such that the value can be written in one write. If 64-bit this will require 64-bit aligned location for the composite variable.

Jim Dempsey

www.quickthreadprogramming.com

>>Actually, a much easier encoding of my location is location = (1+itmIdx)*ts + ss.<<

Since ts (in your original code is decrementing) consider using an incrementing value for the ts component.

location = (spacestepmax+1)*(N-ts) + ss;

Jim Dempsey

www.quickthreadprogramming.com

The algorithm now works under various cases. Tested with up to 4 omp threads, with optimized.
As expected, it fails miserably in terms of performance. Here are the times on 2 C2D 4MBL2 cache 2.66Ghz xeon5100

N = 10 000 steps, itm=6000
sequential => time=0.0717189 result=68.524
2 omp threads => time=0.67872 result=68.524
3 omp threads =>time=2.6478 result=68.524
4omp threads => time=2.27354 result=68.524
Source file attached (PS using an int64_t for the location number because with 500 000 steps, you get higher than 32bit. I'm not sure it is portable to other compilers/platforms : intel/win32, intel/win64, intel/linux64, VS2005/32, VS2005/64,
the TR1 extensions include this, but it worked for me)

For the record, here are times with my other more complex algo (splits work vertically over the threads, not alternating horizontally).
2 thr => 0.0734222 s result=68.524
3 thr => 0.0595891s result=68.524
4 thr => 0.052913s result=68.524

And with 64000 steps, the timings are better: (seq 2.93872s, 2thr 2.2961, 3thr 1.89911, 4thr 1.62283s)

I will now study the source and comments you posted before,

Thanks again,

Attachments: 

AttachmentSize
Download Tree.h5.98 KB

Time to use the profilier.

Jim

www.quickthreadprogramming.com

Hi

>>> C++ rtl Sleep()
can't find it anywhere. Do you know which include file?
I use intel c++ on vista64 and on macosx. For macsox, i just used sleep() from .

As i have merged ts and ss into 1 long (64bit on my platforms), so the value is written in 1 write, I didn't use BlockMin128 from your post (as i think that was just to atomically write ts and ss together). I used however SpinMax, and Max...counts....inside the omp // region, as thread-local vars.

I used _mm_malloc() for the locations[nbrThreads] array:
volatile long *const locations = (long*)_mm_malloc(sizeof(volatile long)*nbrThreads, 8);
And I aligned it on 8 (64bit).

Attached is the source file. The results are interesting (10000 N):
seq:
time=0.0785069s result=110.986
2 threads:
time=1.6507 result=110.986
thread 0: MaxSpinCountDependent=0 MaxSpinCountDependee=251
thread 1: MaxSpinCountDependent=0 MaxSpinCountDependee=251
3 threads:
time=2.88039 result=110.986
thread 0:MaxSpinCountDependent=0 MaxSpinCountDependee=256
thread1: MaxSpinCountDependent=0 MaxSpinCountDependee=255
thread 2: MaxSpinCountDependent=0 MaxSpinCountDependee=255
4threads:
time=2.21 result=110.986
thread 0:MaxSpinCountDependent=592 MaxSpinCountDependee=254
thread1: MaxSpinCountDependent=604 MaxSpinCountDependee=256
thread 2: MaxSpinCountDependent=592 MaxSpinCountDependee=256
thread 3: MaxSpinCountDependent=601 MaxSpinCountDependee=254

It's as slow as before. I'll increase the Max to go slightly above 250 to 260 and see what happens?

Attachments: 

AttachmentSize
Download Tree.h5.98 KB

Why does your code care if the depentent has cought up with the current thread? The depentent thread should be able to fall multiple steps behind. Your only concern should be not progressing beyond which you depend on (dependee).

www.quickthreadprogramming.com

Saythere are 3 threads.

0 1 2 ...... N-5 N-4 N-3 N-2 N-1 N

thread 0 will do N, N-3 ... thread 1 N-1, N-4... and thread 2 N-2 N-5....

we have only 1 array for thread0, another one for thread1 and a 3rd for thread2. The 3 arrays are encoded as a single array (row0, thr0) (row0, thr1) (row0, thr2) (row1, thr0)...

At step N-3, thr0 doesn't want to override entry at row0 (which so far contains row0 at step N), until thr1(dependent for thr0) has finished using that value, because thr0 is dependee for thr1.

Finjulhic,

That is what you intend the code to do. From the timing it looks like that is not what is happening. It looks as ifon averageone of the threads is running while three are waiting.

A suggestion would be to get the program running (performing the test repeatedly) then while running set a break point at the first useful statement following the two while wait loops. Record where each thread is located and pertenant values of their states. Then remove the break point, Continue, set break point again. Repeat to build up a statistical sample. I think you will find that you tend to havetheads in the while loops. More importantly the table of state values may shed light on why your states are not advancing in the manner you thought. Note, when the 1st thread hits the break point the other threads will not stop immediately. There is some slew in the time it takes for the debugger to pause the other threads.

Sorry I do not fully understand your algorithm (as few as lines as it is). It appears that with 4 threads (0:3) that 0:2 are blocked until 3 advances to the next group and at which point thread 2 can proceed while 0:1 wait and 3 can proceed but only for one row worth of work (and then waits till 0 completes) but 0 is waiting for 1. My gut feel is at most you will observe 2 threads running at a time but in work unit time intervals it will look to each thread run, wait, wait, wait, run, wait, wait, wait, run,...

a 10x and larger performance hit would indicate something like this is going on.

The statistical samples of the states will shed some light on what is happening.

Jim Dempsey.

www.quickthreadprogramming.com

Thanks Jim, attached is a complete example with 3 threads. The top figure has the 3 arrays as they evolve backwards in time from 32.
double opt[17][3]; // perhaps a better encoding is opt[3][17], 17 is the height.
The last vertical elem is 0.

We start at timestep 32, thread 0 does the if(thr==0) bit.To calc (31,0) thr1 is waiting for (32,0) and (32,1) to be ready. thr2 is also waiting for (31,0) and (31,1) to calc (30,0).
thr0 continues independently on with (32,1)....(32,15) (32,16)=0, but thr1 (if running at same time , big IF, ) follows and thr2 follows to. thr0 before it can start up again at space=0 and override opt[] spacestep 0, has to make sure thr1 has actually finished (31,0)...

Is this what you had understood? Does the os and other work interruption totally invalidate this? I gather it's not the right implementation then? Or just this algo has no chance in working, and i should stick to my other algo which does scale?

rds,

Attachments: 

AttachmentSize
Download tree.GIF51.44 KB

Just how much processing time is involved in your inner loop? (cT::exercise() and cT::treeStepBack(...))

Have you run a profiler on the code?

Jim

www.quickthreadprogramming.com

cT::exercise() is a static inline returning an enum. So it should be compiled away at template instantiation. That if should not appear at all. (i removed the if and recompiled with just the 1 case, times were the same)

cT::treeStepBack(double opt0, double p0, double opt1, double p1,double implicit) also static inline

{
const double v= opt0*p0 + opt1*p1;
return v>implicit? v : implicit;
}

Does the implementation match the idea? Perhaps it's just impossible to get scaling with this algorithm? (the other one which works for me is vertical)

Will try to run VTune (vista64, it's only 1 C2D, so 2 threads max), and try to view the states you mention in your previous post.
The other box is Mac(2 C2Dxeons) i'll try Shark (it has a sampling L2 cache misses)

Thanks,

Finjulhich,

Then it would appear that the body of the code being executed (per thread) x the number of threads is less than the time it takes to start the threads for the omp parallel section. A negative net gain.

You might try inverting your wait and run loops (making adjustment accordingly). Between while(ts and for(ss wait until you can make a full runof ss (0:spacestepmax-1)then don't wait in the for(ss loop. The negative consequence of this is it will add a small amount of latency to get the 2nd thread going (and 3rd, 4th) but the positive consequence is there will be less interference during the run of that inner loop. You will have to flesh out the details.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today