OpenMP refusing to thread

OpenMP refusing to thread

Steve, Colleagues,

Can anyone suggest why OpenMP would not thread the following (simple) double loop?

CALL OMP_SET_NUM_THREADS(4)
MemForEachStack = 16000000
call kmp_set_stacksize_s(MemForEachStack)
!$omp parallel do default(firstprivate), shared(ReceiverDistribution)
xloop:  &
do i = NumXLimit1,NumXLimit2
	do j=NumYLimit1,NumYLimit2
		ReceiverDistribution(j,i) = sqrt(float(i*j))
	end do
end do xloop
!$OMP end parallel do 

I am experimenting with OpenMP multi-thread in our routines and have managed to multi-thread the smaller one. But I have been unable to multi-thread our largest routine (large for us: ~10,000 lines of code with ~5,000 lines of supporting subroutines). I am attempting to declare a parallel do section involving a set of nested loops. The code involved (~500 lines) within the inner loop is rather elaborate -- with calls to subroutines. I have "use omp_lib" at the start of the routine. I eventually worked through the process of getting the correct list of variables in the shared clause, and got the code to run.  But OpenMP would not multi-thread the outer loop. I get no compiler errors. The code links and runs correctly, but with only one thread. A write of omp_get_thread_num() to the standard output within the loop always shows zero. That is, one thread.

If I turn on the vector optimization report, it shows the 1000's of instances where the code has been successfully vectorized. But I can get no information about (non)threading.

Finally, I commented out ALL the original double loop-code and substituted the simple double-loop listed above. Still no multi-threading. I get no messages from OpenMP, though I have the report level set to 2. (But I don't think that will help, since I think it only reports success, not the reason for failure). The code runs without incident, thought (of course) the results are different since I'm putting nonsense into the array ReceiverDistribution, rather than doing the more elaborate work. 

There is still a lot of code before and after the simple double loop listed above. But I cannot understand how it could be suppressing threading; however complex it is.

I am using compiler release 14.0.3.202.  Worringly, I get exactly the same behavior with beta 15.0.070

David

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The OpenMP compiler reports are turned on by e.g. /Qopenmp-report2.  default(firstprivate) seems a strange choice which may be unsuitable here where i and j must be private, and will be so by default.

Check to be sure you are not linking with the OpenMP stubs.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Quite sure I'm not. Here's what VS 2013 lists as what it's passing to the compiler:

/nologo /debug:full /O3 /Oy- /assume:buffered_io /free /Qopenmp /Qopenmp-report2 /Qvec-report0 /module:"Release 2.8\\" /object:"Release 2.8\\" /Fd"Release 2.8\vc100.pdb" /traceback /libs:static /threads /c

The other (very) indirect evidence that I have that something is wrong is that when I run this routine in VTune to tally hotspots (from within VS 2013) it takes > 20 minutes to complete (!) The small test data set I'm using generates a run that completes in 15 seconds OUTSIDE of VTune. The lapsed time for VTune is 1400 seconds; the total CPU time is reported as 190 seconds with "WaitForSingleObject" taking 167 sec, and the my routine and its subroutine taking a total of 13 seconds. Strange. There is no recording or indication of why the lapsed time is 1400 seconds but the CPU time is 190 seconds. This difference seems usually to be caused by VTune's own monitoring/accounting activities, but I've never seen such a difference.

David

 

If OpenMP compilation has done something nonsensical with your default(firstprivate), that may explain poor threaded performance, and might also be an opportunity for improved compile-time warnings.

4 threads is not normally so many as to require working around the normal VTune setup, e.g. by arbitrarily reducing your expected run time setting so as to increase sampling interval.

If you are running 4 threads but those threads aren't concurrent,  the VTune timeline would show the threads active at different times.

David,

Things to check:

Check to see if the environment variable OMP_NUM_THREADS=1

Check to see if the code listed above is called from within a parallel region .AND. nested parallelism is disabled. You can call the function omp_in_parallel() preceding the !$omp parallel... listed above. If it returns .TRUE. then you are within a parallel region (and attempting to nest).

Note, I am not telling you to enable nested parallelism. That has a different set of issues above and beyond a first conversion project.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Thanks for the suggestions; but all those check out. As it happens, if we put a simple single-loop anywhere in this large routine, it is not threaded. If, however, we call a throw-away subroutine that contains the loop, it is threaded there. So it isn't a project-wide problem. There is something about all that code in the large routine that is preventing thread. We've checked the compiler flags and parameters and they are the same for the big routine and the little throw-away. Unhappily, we've run out of resources, and must back away from attempting to thread this code.

David

Leave a Comment

Please sign in to add a comment. Not a member? Join today