Problem with OpenMP addition into a shared array

Problem with OpenMP addition into a shared array

Solution found

I have solved this problem myself, with a little discussion with a colleague. The first implementation described in this post is not thread safe. The second implementation described below in post #4 is thread safe. The fundamental principle is that you should not allow two threads to add into (or write) one array element at the same time. I was deluded into thinking that the first implementation would work because I was specifying that an "array" was shared. In fact, only the pointer to the array was shared, not the actual data in the array, which had been allocated using malloc. The problem was evidently caused by collisions between two threads trying to modify the same data element of the array.

This may be of interest to others who encouter similar issues.

Although a solution has been found, a secondary issue is that it might be nice if either OpenMP or the Intel compiler provided a mechanism to protect the elements of an array from collision, or at least give a warning if they were not protected. Or do they?

Original problem - see below for solution

I have some OpenMP code that is trying to perform additions into the elements of a shared 2-D array, but it creates random errors, as if the threads were interfering / conflicting with one another. The code that performs the additions is in an external function. A pointer to the shared array is passed into the external function. The calling function is parallel, but the external function that does the additions is not. At issue is how a single iteration of a parallel for loop executes the external function and whether multiple threads can cause interference in the addition into the shared array. If I make the external code section that performs the addition omp critical, the random error goes away, but so does the speed gain -- parallism is destroyed. If I run the code with the #pragma omp commented out, it also works, again destroying parallel speed.

The logic is described below. Obviously it is wrong or not thread safe. Please help me fix this if you can.
Sorry, I don't know how to make my indents stick.

existing logic (not exact code)
--------------------------------------------------
excerpt from calling function, containing parallel for loop
--------------------------------------------------

float var1, var2;
int i, j;
float** outxy; // shared accumulator

//... function allocate_matrix() defines row pointers so outxy may be indexed as outxy[i][j]
outxy = allocate_matrix(rows, columns);
//... zero out accumulator
for (i = 0; i < rows; i++)
{
for (j = 0; j < columns; j++)
{
outxy[i][j] = 0;
}
}
//... calculate local variables that are independent of loop index
var1 = xxx;

#pragma omp parallel for shared(outxy, nloop, rows, columns) private(iloop, var1, var2)
for (iloop = 0; iloop < nloop; iloop++) // intent is to have one thread for each iloop index
{
//... calculate local variables that depend on iloop index
var2 = yyy;
//... for this value of iloop, perform addition into outxy elements using external function accumulate()
accumulate(outxy, iloop, rows, columns, var1, var2); // called in parallel: each call is by one thread?
}

//... process (outxy)
free_matrix(outxy)
.
.
.

------------------------------------------------
external function that does accumulation
------------------------------------------------

void accumulate(float** out, int loop, int rowmax, int colmax, float x, float y)
{
int i, j;
//... calculate local variables for this function
var = f1(x, y, loop);
//... perform addition for one value of "loop" in calling function
for (i = 0; i < rowmax; i++)
{
for (j = 0; j < colmax; j++)
{
//... add stuff into an element of out matrix
stuff = f2(i, j, var);
out[i][j] += stuff;
}
}
}

------------------------

Thanks in advance for your help.

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

TheIntel Parallel Studio (the Parallel Insepctor)(Windows only) seems a good tool for this. The Parallel Inspector will find out any possible race condition etc. Also the parallel debugger extension might be useful here to debug the issue.

what do you mean "external function"? is it in another .dll?

Thanks,
Jennifer

Quoting - Jennifer Jiang (Intel)

TheIntel Parallel Studio (the Parallel Insepctor)(Windows only) seems a good tool for this. The Parallel Inspector will find out any possible race condition etc. Also the parallel debugger extension might be useful here to debug the issue.

what do you mean "external function"? is it in another .dll?

Thanks,
Jennifer

Be external function, I mean only call to another C++ function. It is probably bad terminology. The function is compiled along with the others. There are no DLLs. The point is, a function is called within the parallel for loop, but its code is not contained explicitly within the parallel for loop. The question is, is this function call thread safe, and if not how do I make it so?

I don't have Parallel Studio. I have only the compiler version 11.

The problem with using a debugger is that I do not know what to look for at this point. I could simply put print statements in if I did, but the errors are random incorrect pixels in mostly good pixels, and I am not sure I can catch it. I was hoping that someone could look at the code logic and determine that it does or does not meet OpenMP requirements.

I have been working on this while waiting for a reply, and I have some additional information. I found another way that works but is not ideal to me. If I take the "#pragma parallel for" out of the outer loop in the calling program and put a "#pragma parallel for" outside the double loop in the function accumulate() function, then the accumulate function processes each row of "out" in a separate thread, and each thread sums only over columns of the"out" array.

I imagine that somthing about passing the argument outxy to the function accumulate() is causing the parallelism to break. I suspect that it has something to do with the fact that the loops are not actually acting on the same piece of memory, or writing into a 2-D array like that (pointers to pointers) is not thread safe. I tried putting "#pragma omp flush" at various places and this did not work, either.

I don't really like this solution because:

(1) it is about 20% slower than before because with non-parallel accumulate() more work is being done uninterrupted by eash thread;

(2) it seems logical to make accumulate() a separate function that does not have to be parallelized.

If anyone can explain to me why the function call in the original code example, in which the entire outxy array is incremented non-parallel by one thread by the accumulate() function, does not work, and propose a fix, I would be grateful.

In the meanwhile, I will forge ahead and report improved results if I obtain them.

Second implementation that is thread safe

OK, I think I am done with this, and I think I have an explanation.

The second implementation, putting the #pragma omp parallel for in the accumulate function, so the threads are looping over rows of outxy, guarantees that no element outxy[i][j] will be added into at the same time. This is because each row is done by a separate thread, and no two threads operate on the same row. So, it is thread safe.

The first implementation, where the threads controlled the outer loop, allowed two threads to add into the same elerment outxy[i][j] at the same time, so it was not thread safe. The failure was random because this happened only occasionally. As a matter of fact, sometimes the results were perfect, but usually there were random bad pixels.

The funny thing is that this code has been in existence, in a thread-unsafe state, for several years, and this problem was not noticed. The reason for that is that we have been running it on older computers until now, and we were using an older version of the compiler. With that situation, eveything worked as we thought it should. Now, with new computers and the new compiler, the old code failed.

If anyone has any input as to exactly why the first implementation used to work and now it doesn't, please add to this thread. In the meanwhile, I am sticking to the second implementation that works.

For the record, I the outline code for the one that works is listed below.

--------------------------------------------------
revised logic (not exact code)
--------------------------------------------------
excerpt from calling function. no longer parallel
--------------------------------------------------

float var1, var2;
int i, j;
float** outxy; // shared accumulator

//... function allocate_matrix() defines row pointers so outxy may be indexed as outxy[i][j]
outxy = allocate_matrix(rows, columns);
//... zero out accumulator
for (i = 0; i < rows; i++)
{
for (j = 0; j < columns; j++)
{
outxy[i][j] = 0;
}
}
//... calculate local variables that are independent of loop index
var1 = xxx;

for (iloop = 0; i < nloop; i++)
{
//... calculate local variables that depend on iloop index
var2 = yyy;
//... for this value of iloop, perform addition into outxy elements using function accumulate()
accumulate(outxy, iloop, rows, columns, var1, var2);
}

//... process (outxy)
free_matrix(outxy)
.
.
.

--------------------------------------------------------------------------
function that does accumulation. Parallel in loop over rows of out matrix
---------------------------------------------------------------------------

void accumulate(float** out, int loop, int rowmax, int colmax, float x, float y)
{
int i, j;
//... calculate local variables for this function
var = f1(x, y, loop);
//... perform addition for one value of "loop" in calling function
#pragma omp parallel for shared(out, rowmax, colmax, var) private(i, j, stuff)
for (i = 0; i < rowmax; i++)
{
for (j = 0; j < colmax; j++)
{
//... add stuff into an element of out matrix
stuff = f2(i, j, var);
out[i][j] += stuff;
}
}
}

------------------------

Quoting - smithpd
[...]
I don't have Parallel Studio. I have only the compiler version 11.

The problem with using a debugger is that I do not know what to look for at this point. I could simply put print statements in if I did, but the errors are random incorrect pixels in mostly good pixels, and I am not sure I can catch it. I was hoping that someone could look at the code logic and determine that it does or does not meet OpenMP requirements.

Intel Compiler Pro v11.1 comes with the very same Parallel Debug Extensions to the Microsoft VS Debugger as in Intel Parallel Studio (Windows) that should help you to detect exactly such problems in OpenMP code. The Linux version contains a GUI-driven Debugger with same parallel debug capabilities as on Windows. For details please see http://software.intel.com/en-us/articles/parallel-debugger-extension/ (Windows) or http://software.intel.com/en-us/articles/idb-linux/ (Linux).
Although you seem to have found a solution by now, I'd encourage you to try these debugger features and would be very interested to hear back from you if you find them useful.

Thanks, Thomas

In the first case you had thread distribution on outer loop index level and with multiple threads accumulating into the same out array with the potential of multiple threads concurrently attempting out[i][j] += stuff; using the same i and j.

In the second case you moved the scope of the parallel for deeper into your nesting level deeper by one level (to the for(i=0... loop). This corrected the two threads read/modify/writing to the same cell of out.

This will work fine as long as rowmax is relatively large (or happens to be an integral multiple of the number of threads). When rowmax is small consider adding an index to out such that you perform

out[loop][i][j] += stuff;

Then after the outer parallel for (at loop level) has complete then consolidate all out[loop][i][j]'s where loop>0 into loop==0.

Use outer parallel for on index i, and inner two loops with loop and j.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

In the first case you had thread distribution on outer loop index level and with multiple threads accumulating into the same out array with the potential of multiple threads concurrently attempting out[i][j] += stuff; using the same i and j.

In the second case you moved the scope of the parallel for deeper into your nesting level deeper by one level (to the for(i=0... loop). This corrected the two threads read/modify/writing to the same cell of out.

This will work fine as long as rowmax is relatively large (or happens to be an integral multiple of the number of threads). When rowmax is small consider adding an index to out such that you perform

out[loop][i][j] += stuff;

Then after the outer parallel for (at loop level) has complete then consolidate all out[loop][i][j]'s where loop>0 into loop==0.

Use outer parallel for on index i, and inner two loops with loop and j.

Jim Dempsey

Thanks, Jim. Hey, is your cove the La Jolla cove? I used to live near there.

I have two comments about that extra index you suggested.

First, FYI, rowmax is large in our case, so the extra index will not work. Typically rowmax is 1000 - 4000 and we store 16 bits per pixel, so the image we are accumulating into is up to 32 MB at present. We are about to try a 8000 x 8000 problem, a factor for 4 greater. The outer loop is presently over, say, 1000-2000 partial images we are adding together, so we would need to store up to 64 GB in memory now and four times that for the 8000 problem!

Second, I want to be sure that the second method is always thread safe. I think that the way I have done the second implementation, no more than one thread can ever access one row of the accumulation matrix outxy. The omp parallel for loop is over rows. As I understand it, this means that a single thread is assigned to any given row, is it not? Then there is a default synchronizing barrier (wait) for threads to complete at the end of the omp parallel for loop. If so, isn't the second method guaranteeed to be thread safe?

Some other thoughts about this problem in general....

Contemplating this after the fact, I think the reason the first implementation failed is that the shared variable was a pointer to the data, not the data itself. All examples I see on the web use scalar variables, which I think are protected against collision if they are shared. If you making repetitive operations on the elements of an array, I am unaware of any way that OpenMP can protect the entire array using the "shared" key word. For some reason unknown to me, the first implementation did actually work on our older computers with an older version of the compiler. It suddenly started failing with some newer computers and the newer compiler. I don't know which or why. It could be that the compiler recognized the pointer as pointing to an array, or maybe the older computer treated cache differently. In any event, the non-OpenPM branch of the code works fine with the first implementation because we do not have to worry about the multiple thread collision.

I wonder if OpenMP has, or plans to have, a method to protect arrays from collisions. I wonder if Intel could make some switch that recognizes an array, of maybe they did so and abandoned it in version 11.

FYI, I tried "#pragma omp critical" on the line in the inner loop of the accumulation. It gave correct results but took a long time to compute, 9 times longer than with no OpenMP whatsoever. Setting and releasing locks in the inner loop is obviously not a good idea.

I live in what used to be my father's house known in the area as "The Cove".

What you might consider doing then is method 3..

Create n arrays of out, one for each thread team member number. This will reduce the storage requirement from (1000:2000) * 32MB to n * 32MB. Then after you perform the outer most parallel for loop reduce the n arrays into array[0].

4th method (requires no additional memory)

Just inside the parallel for loop (that runs through immages) divide the out array into n slices (one slice for each thread), derive the slice number from the omp_get_thread_num(). Note, you can use a firstprivate using an external to for loopint threadNum=-1; and internal to for loop if(threadNum < 0) threadNum = omp_get_thread_num(); (or use TLS). Or bear the overhead of the function calls. With array out divided up in this manner no two threads will access the same cells.

If you find threads finishing up at a large skew then set schedule(static, 1) or try schedule(dynamic). Try schedule(dynamic). first.

Don't use critical, it will slow you down. Program to avoid critical.

Jim Dempsey

www.quickthreadprogramming.com

Thanks again, Jim. Those are some interesting ideas, and you have enriched my understanding of OpenMP. I really think, though, that implementation #2, looping over rows of the out matrix, is best for me. That is because (a) it is the simplest code, which can be understood at a glance, (b) it requires no additional memory, and (c), I think now that the additional run time compared to the first (incorrect) method is nil, or at least not large enough to warrant making the code more complicated.

So, I have done #2, and it works fine.

Thanks again.

One other quick and dirty (clean) experiment

add

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
} // end of for(... outer loop here
} // end scope of parallel region

What the above will do is reduce the numberof timesa thread pool isassembled/dissassembled.

Jim Dempsey.

www.quickthreadprogramming.com

You may need to add #pragma omp barrier too

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
// barrier should be implicit here
// if you have problems add
#pragma omp barrier
} // end of for(... outer loop here
} // end scope of parallel region

www.quickthreadprogramming.com

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
#pragma omp master
{

}
#pragma omp barrier
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
// barrier should be implicit here
// if you have problems add
#pragma omp barrier
} // end of for(... outer loop here
} // end scope of parallel region

www.quickthreadprogramming.com

I was off on Friday. I'll try those ideas today.

The last method is not friendly to other apps running on the system (but you can vary the block time if this becomes an issue).

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Of the three solutions you proposed above, I tried the first and the second. The barrier was not needed, but each of these solutions was slower than my implementation #2, in which the outer loop had no OpenMP and the multi-threading was only in the outer for of the inner loop. The time to run your recent suggestions was about 13% greater than the time for my implementation #2.

I am not sure exactly what is causing the slowdown, but I think it could be two things. First, I did not list the complete code in the problem statement of this thread because I was trying to simplify the description here and find something that would work, which I did. One factor missing from my description is that, with OpenMP in the outer loop, I had to malloc and free memory for another array inside the outer loop so that data which was read into that array (in the loop) could be private to a thread. That repeated malloc / free takes time. With no OpenMP in the outer loop, I can share that memory array and malloc / free it outside the loop. A second possibility is that OpenMP must transfer all of its thread information into the function accumulate(), which contains the inner loops.

In addition to eliminating the memory allocation in the outer loop and passing thread information, another advantage of my implementation #2 is that now I can return from within the outer loop if there is an error in the inner loop. I could not do that with OpenMP. Both this and the malloc / free required a duplicate section of code to implement OpenMP in the outer loop, which as been elimininated with my implementation #2.

So, for all these reasons, I think my implementation #2 is the preferred solution. It is faster, simpler, and more functional. My previous dislike of implementation #2 was clearly unfounded.

Thanks again for your help and interest in this problem.

The malloc and free were likely the cause of the slowdown. Do the malloc once (or once each time the temporary array needs to grow). Error bailout can be done with a flag.

#pragma omp parallel
{
double* YourArray = new double[yourSize];
bool error=false;
#pragma omp for
for(int n=0; n {
if(error) { n=N; continue; }
// read in
...
if(error) { n=N; continue; }
... // process buffer
}
if(YourArray)delete [] YourArray;
}

I would think your bigest choke point might be not being able to overlap the I/O with processing.
Look at the OpenMP tasking available in the 3.0 spec. This might make it a little easier to overlap the I/O with the processing (e.g. double buffer the input and output).

This can be well worth the effort (assuming your wait time is absurdly long).

Jim Dempsey

Jim

www.quickthreadprogramming.com

Thanks again, Jim.

In your last example, does the outer for loop (#pragma omp for) branch to mutiple threads, each operating on different index values n? If so, the different threads cannot read into the same array because the array is a function of n. Each needs its own private storage for the buffer. That is why I put the malloc inside the loop. Am I missing something? If they do not branch to multiple theads, what is the point of informing omp about this outer loop?

I suppose the next step is to allocate nthreads arrays and rotate between them. But really, as I indicated previously, I think that there is little to gain by adding complexity. Another factor I should mention is that the time to load the data from disk into buffer is small compared with the time to do the accumulation in the inner loop, which involves some arithmetic operations, so messing with the outer loop is not too productive.

#pragma omp parallel

Begins parallel region (all threads execute)

Each instance of "double* YourArray;" is inside parallel region and therefore is thread private.

#pragma omp for

*** within parallel region ***
Slices iteration space of next foramongst threads
#pragma omp for
for(int n=0; n{

Each thread has seperate copy of n
and
each thread has different slice of iteration space (0,N]

You do not need to deal with threadprivate since each copy of YourArray (pointer) is on private stack of thread.

In above, you may (or may not)need enclose portion(s) of your read-in in a critical section. e.g. if all datasets for n stored in single large file (each bite of file in serial but all other code in parallel).

Jim Dempsey

Jim Dempsey

www.quickthreadprogramming.com

Jim,

You suggested in post #16 that I could allocate the "buffer" between the "#prama omp parallel" and the "#pragma omp for". I tried this and it didn't work properly. It ran and gave correct results, but it allocated only one array, it ran in only one thread, and so OpenMP was defeated. Code and results are shown below.

The only conditions in which I can get it to run in multiple (8) threads are:
(1) I allocate the memory inside the "#pragma omp for" loop, and
(2) I put no statements whatsoever between the #prama omp parallel" and the "#pragma omp for"
Observation (2) makes me wonder what is the use of spltting the "#prama omp parallel" and the "#pragma omp for".

I did this testing with a stripped down version of my actual code. Below are code listings of the relevant sections for both inside-loop and outside-loop memory allocation. Each code case is followed by some output from the code. If you have some time, please let me know why this is not working properly when allocation is done outside the for loop.

Recall that I have a solution that works, implementation #2, in which this outer loop is not parallelized and the OpenMP is moved further down in the code. So, this investigation is mainly of academic interest. It might, however, improve my understanding of OpenMP.

Code for allocation inside for loop (runs in 8 threads)

abort = 0;
c_rows_save = 0;
int tmax, tnum;
#pragma omp parallel default(none) private(c_rows, rawImageEquiv, sin_count, tnum) shared(tmax, inpt, lookup, out_xy, fc, angles, sino_index, cout, cout_prefix, abort, c_rows_save)
//....Loop over angles
#pragma omp for
for(c_rows = 0; c_rows < angles; c_rows++)
{
tmax = omp_get_num_threads( );
tnum = omp_get_thread_num( );
rawImageEquiv = (f_type **)malloc( fc.total_sins * sizeof(f_type*));
if ( rawImageEquiv == NULL )
cout << cout_prefix << "rawImageEquiv allocation FAILED inside loop for thread "
<< tnum << " of " << tmax << ", row " << c_rows << " of " << angles << endl;
else
cout << cout_prefix << "rawImageEquiv allocated successfully inside loop for thread "
<< tnum << " of " << tmax << ", row " << c_rows << " of " << angles << endl;

//....Set rawImageEquiv pointers to point to the raw image rows
//....applicable to the particular angle (c_rows)
for( sin_count = 0; sin_count < fc.total_sins; sin_count++ )
{
rawImageEquiv[sin_count] = sino_index[sin_count].image[c_rows];
}
//....Back project at an angle. Fan beam done with in-plane geometry
if (abort == 1)
{
if (rawImageEquiv) free(rawImageEquiv);
continue;
}
if( !feld_project_angle(inpt, lookup, out_xy, c_rows, fc, rawImageEquiv) )
{
if (abort == 0)
{
c_rows_save = c_rows;
abort = 1;
}
}
if (rawImageEquiv) free(rawImageEquiv);
} // end loop over angles
if (abort == 1) //....Free all storage and return in error mode
{
cout << cout_prefix << "ERROR> feldkamp projection failed for angle " << c_rows_save << endl;
return 0; //.... Must return outside the loop for OMP
}
cout << cout_prefix << "Completed loop over angles, sinograms, OpenMP." << endl;

Output of interest from allocation inside loop

rawImageEquiv allocated successfully inside loop for thread 2 of 8, row 180 of 720
rawImageEquiv allocated successfully inside loop for thread 3 of 8, row 270 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 90 of 720
rawImageEquiv allocated successfully inside loop for thread 0 of 8, row 0 of 720
rawImageEquiv allocated successfully inside loop for thread 4 of 8, row 360 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 630 of 720
rawImageEquiv allocated successfully inside loop for thread 6 of 8, row 540 of 720
rawImageEquiv allocated successfully inside loop for thread 5 of 8, row 450 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 91 of 720
rawImageEquiv allocated successfully inside loop for thread 3 of 8, row 271 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 631 of 720
...
rawImageEquiv allocated successfully inside loop for thread 5 of 8, row 539 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 719 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 179 of 720
Completed loop over angles, sinograms, OpenMP.
Run time of slice reconstruction = 3.156 seconds.

Code for allocation outside loop (runs in only one thread)

abort = 0;
c_rows_save = 0;
int tmax, tnum;
#pragma omp parallel default(none) private(c_rows, rawImageEquiv, sin_count, tnum) shared(tmax, inpt, lookup, out_xy, fc, angles, sino_index, cout, cout_prefix, abort, c_rows_save)
tmax = omp_get_num_threads( );
tnum = omp_get_thread_num( );
rawImageEquiv = (f_type **)malloc( fc.total_sins * sizeof(f_type*));
if ( rawImageEquiv == NULL )
cout << cout_prefix << "rawImageEquiv allocation FAILED outside loop for thread "
<< tnum << " of " << tmax << endl;
else
cout << cout_prefix << "rawImageEquiv allocated successfully outside loop for thread "
<< tnum << " of " << tmax << endl;
//....Loop over angles
#pragma omp for
for(c_rows = 0; c_rows < angles; c_rows++)
{
//....Set rawImageEquiv pointers to point to the raw image rows
//....applicable to the particular angle (c_rows)
for( sin_count = 0; sin_count < fc.total_sins; sin_count++ )
{
rawImageEquiv[sin_count] = sino_index[sin_count].image[c_rows];
}
//....Back project at an angle. Fan beam done with in-plane geometry
if (abort == 1)
{
continue;
}
if( !feld_project_angle(inpt, lookup, out_xy, c_rows, fc, rawImageEquiv) )
{
if (abort == 0)
{
c_rows_save = c_rows;
abort = 1;
}
}
} // end loop over angles
if (rawImageEquiv) free(rawImageEquiv);
if (abort == 1) //....Free all storage and return in error mode
{
cout << cout_prefix << "ERROR> feldkamp projection failed for angle " << c_rows_save << endl;
return 0; //.... Must return outside the loop for OMP
}
cout << cout_prefix << "Completed loop over angles, sinograms, OpenMP." << endl;

Output of interest, allocation outside loop

rawImageEquiv allocated successfully outside loop for thread 0 of 8
outside allocation, thread 0 of 8, row 0 of 720
outside allocation, thread 0 of 8, row 1 of 720
outside allocation, thread 0 of 8, row 2 of 720
outside allocation, thread 0 of 8, row 3 of 720
outside allocation, thread 0 of 8, row 4 of 720
outside allocation, thread 0 of 8, row 5 of 720
outside allocation, thread 0 of 8, row 6 of 720
outside allocation, thread 0 of 8, row 7 of 720
outside allocation, thread 0 of 8, row 8 of 720
outside allocation, thread 0 of 8, row 9 of 720
outside allocation, thread 0 of 8, row 10 of 720
...
outside allocation, thread 0 of 8, row 717 of 720
outside allocation, thread 0 of 8, row 718 of 720
outside allocation, thread 0 of 8, row 719 of 720
Completed loop over angles, sinograms, OpenMP.
Run time of slice reconstruction = 23.891 seconds.

Add braces {} around outer parallel region

#pragma omp parallel...
{
...
#pragma omp for
for(...
{
}
...
} // end pragma omp parallel...

Without the braces the scope of the parallel region is the next statement (just the tmax= line)

Jim Dempsey

www.quickthreadprogramming.com

LOL. I feel stupid for omitting the braces.

OK, I put the braces in, and it worked with the outside-loop array allocation. Allocating outside the loop gives one array per thread, whereas allocating inside the loop gives one array per loop index. This saves about .5 seconds out of 3.5 seconds, so that aspect of it works.

But now I still a problem with the logic of the outer loop and run time being slower when I try to create threads at that point. Referring to your post #10, if my code is essentially as you outlined it, I have variant 1:

// variant 1
//-----------

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
// allocate array here. Creates one array per thread
for(... outer loop here // entire loop runs for each thread
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
} // end of for(... outer loop here
// free array, one per thread
} // end scope of parallel region

Then it is thread safe, and it runs in 3.4 seconds.

Note that the entire outer loop is run separately by each thread, but the accumulation section is partitioned betwen threads, so the results are OK in the end. I verified this with print statements inside the outer loop. I think this code is strange and hard to follow. It also runs a little more slowly than other examples below.

---------------------

If I parallelize the outer loop but not the inner loop, I have variant 2:

// variant 2
//-----------

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
// allocate array here, one per thread
#pragma omp for // (omit parallel from line)
for(... outer loop here) // this loop is now parallel and distributed between threads
{

for(... // inner loop, non-parallel
{
...
} // end of inner non-parallel loop
} // end of parallel outer loop
// free array, one per thread
} // end scope of parallel region

This is basically my old code that failed because it is not thread safe -- more than one thread can try to accumulate numbers into the same element of the outxy array at the same time, so they collide randomly and something bad happens when they do collide. It runs in 2.9 seconds and gives random errors.

----------------------

If I parallelize both outer and inner loops by putting "#pragma omp for" in front of each loop, then only 1/(number of threads) projections are accumulated in the inner loop. It runs 8x faster but the results are completely wrong.

----------------------

If I parallelize only the inner loop, my implementation #2 as reported in post #4, then I have variant 3:

// variant 3
//-----------

// allocate array only once and share it within the outer loop
for(... non-parallel outer loop)
{
// populate array and call projection function
#pragma omp parallel for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp parallel for
} // end of for(... outer loop here
// free array, only one

This is thread safe, it also runs in 2.9 seconds, and it is considerably simpler code for a variety of reasons.

It seems to me that variant 3 is the way to go, and nothing is to be gained by messing with the outer loop. Please correct me if I am wrong.

Try the following in the code with the outer loop as parallel (working but more complex)

use threadprivate to create a persistant temporary array, one per thread. Use two variables, one a pointer to the array and the second the allocation extent.

// in static data
int allocationSize = 0;
double* Array = 0;
#pragma omp threadprivate(allocationSize, Array)
------------------
// in your subroutine
#pragma omp parallel ...
{
if(allocationSize < sizeYouNeedNow)
{
if(Array) delete [] Array;
Array = new double [sizeYouNeedNow];
if(!Array) Barf();
allocationSize= sizeYouNeedNow;
}
...

The temp array will not persist across calls (reducing number of malloc/free).

Note, array will persist through program termination. You can add a cleanup routine.

void Cleanup()
{
if(Array)
{
delete [] Array;
Array = NULL;
allocationSize = 0;
}
}

Jim Dempsey

www.quickthreadprogramming.com

Thanks again for your help, Jim.

Sorry, I do not understand why I would want to do the above. The array allocation is not the problem. Regardless, the array in question is small, of order N rather than N**2, and it has a fixed, known size. The outxy that we are slicing over in the inner loop is of order N**2, so there is a benefit in slicing it between threads with one row of order N processed in a thread. I should add that N could be 4000. The problem with your suggested code is that the outer loop is awkward in variant #1 (see my comment above) and does not provide any benefit in terms of speed compared with variant #3, which I am now using. From what I have seen in testing, I doubt that any degree of messing around with the outer loop will make it run faster. Again, if I am wrong, please tell me how doing something in the outer loop improves efficiency and keeps the inner loop accumulation into outxy thread safe.

Since the discusison seems to have strayed from it, and maybe I never described it well enough, here is a brief description of the algorithm:

Outer loop over M projection angles (M ~= N)
{
Collect an equivalent image to be projected.
The equivalent image is of size L rows x N columns, with L ~= N/10. The L rows of the L x N projection image are collected as an array of row pointers from a non-contiguous set of image rows stored elsewhere. This defines an equivalent contiguous projection image that can be indexed as equivalent_image[i][j].
The pointer collection operation is simple indexing and very fast. There is no need to parallelize it. Then call
project(equivalent_image, ...);
}

project(equivalent_image, ...)
{
First inner loop over N rows in a reconstruction image outxy of size N x N
Innermost loop over N columns in outxy
calculate quantity to be added into outxy[irow][icol] by interpolation in equivalent_image
outxy[irow][icol] += quantity; // must be thread safe
}

With the above logic, if there is no parallelism in the outer loop over angles, then the equivalent images array is allocated only once, and its elements are collected for each projection angle. The collection must be done in any case.

Assume thread pool ofP threads

Your current method requires

M * (N/P) number of (create OpenMP thread team/discharge thread team (implicit barrier)) and one instance of outxy

The technique I am trying to guide you towards is

1 number of (create OpenMP thread team/discharge thread team (implicit barrier)) and P instances of outxy

Now it may be the case that the time spent in (create OpenMP thread team/discharge thread team (implicit barrier)) is a very small percentage of the overalltime to compute a projection angleand not worth going after.

However,

If the outer loop is set with a chunk size of 1 (i.e. each thread picks the next image projection angle, then computes the pojection without benefit of other threads) then there is no implicit synchronization barrier.

What this means is:

Consider the app humming along with 8 or 16 or 32, ...threads computing projection angles (one at a time using P threads), consider other app, say 1 thread,runs on system taking computation resources. This one thread will compete with one of your applications threads (or one at a time). This is going to introduce a completion time skew for that thread (or threads) slice of the inner loop outxy array. This in turn requires the other threads finishing first to either burn time in KMP_BLOCKTIME or go to sleep then wakeup (neither of which is productive)

Same situation with outer loop on M, chunk 1, (each thread does one projection angle) the interrupted thread still takes longer to complete the current projection angle, the other threads do not wait (no burn time, no sleep time). The thread being interrupted (preempted) will likely result in that thread processing fewer projection angles than the other threads. The only adverse situation is when M is relatively small (in which case you parallize the inner loop).

Jim Dempsey

www.quickthreadprogramming.com

OK, I understand the idea, but P instances of outxy is not so good in my opinion, for reasons explained below.

outxy is a large array of size N x N, where N is typically 4000 and now I am working with 8080. So outxy is 8080 x 8080 x 4 bytes / pixel = 261,145,600 bytes or say 250 MB. With other stuff going on plus code size, the code runs in 1 GB when solving a single problem. For this large problem P is 8, so that would gives us 2 GB. That is doable with 64 bits, but I am not crazy about the concept. I have a better way to use memory, which is MPI....

FYI, this is a hybrid method. I also have MPI running as well. OpenMP and MPI are optional, selected in the build configuration. I can have several MPI jobs running on one board and OpenMP running at the same time. It is limited by memory - small memory / job --> more MPI jobs. With proper selection of number of MPI jobs, most memory is occupied and, with any luck most CPU cycles are being utilized. What openMP does in this case is to fill in the cracks if the CPU is being underutilized. That way the CPU is always at 100%. I did not mention this before because it was not crucial to finding and fixing my bug.

Anyway, I find that with 8 threads this program runs about 8 x faster than with a single thread non-OpenMP case. That is with the excessive thread creation / discharges you have noted. In view of this, it seems to me that the thread overhead is minimal.

Does this change your view of it?

I will be gone until Monday.

Yes that fills the cracks.

Make the outer loop distribute the NxN array work across MPI (even to same system), make the inner loop OpenMP. On memory tight system with several cores, the OpenMP will speed things up.

If you can work it such that the NxN arrays are "sticky" with respect to the MPI nodes then you might get by by passing MPI messages containing the array ID in place of the contents of the array. Your NAS filesystem might be able to do some of this for you.

Jim

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today