OpenMP threading analysis

OpenMP threading analysis


I am threading a large bit of code that repeatedly calls a routine (in the present case, 100's of times). Within that routine I have used an !$omp parallel do structure to thread a double do loop. Threading directives surround the outer loop.  I use a "schedule(dynamic)" clause to balance work load (says here), and specify the use of 4 cores. I could use some help interpreting VTune performance data. The time requited by "_kmp_barrier" is by far the largest block of time. For example, VTune reports:

  1. _kmp_barrier             17..53
  2. _kmp_x86_pause      4.52
  3. my routine                 4.05

VTune also reports a histogram showing wall time and the number of cores running simultaneously. As I might expect, the largest time is for 1 core running (the master thread). But the wall times for 2, 3, 4 cores running simultaneously are not about the same. '2' is 5x larger than '3', and '4' is virtually no time. I would think the wall time for 2 and 3 cores running simultaneously would be small, and the wall time for 4 cores running simultaneously would be 2nd largest, after '1'.

Does this mean the work load is wildly imbalanced?

What is so much time taken up by the omp barrier?



7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
jimdempseyatthecove's picture

Can you show the loop?

It appears that the work is unbalanced amongst the threads. When a thread reaches the end of the parallel loop (depending on scheduling) it attempts to get another hunk of work (hunk sizes vary as loop progresses in dynamic). When no additional work remains, then the thread keeps running at the barrier until the remainder of the threads reach the barrier or until the KMP_BLOCK_TIME expires. Then what was the main thread of the parallel region continues running (and may loop to reenter the parallel region), and those threads that have not timed out remain running for the remainder of the block time or until the new/next parallel region is entered. Should a thread expire its block time between parallel regions then it is suspended, and consequently more work is required to restart it on the next parallel region.

The barrier time that the non-main threads consume while the main thread is NOT inside the parallel regions is inconsequential (excepting for power consumption). Excessive barrier time while some threads are yet inside the parallel region is an indication of unbalanced work load.

Knowing more about how your loop(s) behave may be useful for use us to ascertain what is happening with your program.

Jim Dempsey


Here is the code of the threaded loop:

	!$omp parallel do	if(NumXLimit2 > 50), 																							&
						default(private),																												&
						shared(	acosarray, EmitterVerts, EmitterVertsMin, EmitterVertsMax, PointArray, BlockerResolution, blockerverts,	NumBlockers,	&
								BlockerPointer, BlockerVertsmin, BlockerVertsmax, blockernormals, GlazingMapping, GlassTransmittance, ConfigFactors,	&
								RaySpacingDegreesInv ),	&
	xloopBlocker1:	&
	do i=NumXLimit1, NumXLimit2
       (a bit of code)
		yloopBlocker1:	&
		do j=NumYLimit1, NumYLimit2
           (working code here)
		end do yloopBlocker1
	end do xloopBlocker1
	!$omp end parallel do

I have tried various values of chunk (instead of 1) in the schedule clause, but it doesn't affect things very much. I had thought that dynamic scheduling would balance the work load. The number of iterations in the outer loop is in the 100's. It is my understanding that the first time the routine is called, the threads are established. They then remain until used again when the routine is called again. I also assume that (re)encountering the OpenMP function omp_set_num_thread() is harmless. Additionally, removing the if-clause has little effect.



I'm using the environment variable kmp_affinity=scatter to make sure that the threads are running on the 4 physical cores on the machine I have here.


Can you make sure that the variables  (NumXLimit1, NumXLimit2, NumYLimit1, NumYLimit2)  are declared as shared.  Since they are not shared, and you have default(private), these variables might have garbage data.


schedule(dynamic) relies heavily on the OS scheduler to place threads dynamically.   Among other things, I don't expect kmp_affinity=scatter to help out much, unless it is done in a more specific way to compensate for HyperThreading by spreading the work out 1 thread per core.

In cases of work imbalance (at least when no chunk could get more than twice the average work) I often use schedule(runtime) and try out various options like OMP_SCHEDULE=dynamic,2  or auto or guided.  Not knowing your target platform, I'd have no clue which of those might be best.  guided may show gains with a good choice of affinity (not scatter).

Even if your work is balanced, a bad choice of schedule or insufficient chunk size could increase barrier time.

jimdempseyatthecove's picture

Good catch Roman.

I'd suggest to temporarily use default(none) to produce a list of variables expected to be private, then take care of those that are not before going back to default(private).

Jim Dempsey

Login to leave a comment.