2015 installation & Threading

2015 installation & Threading

Steve (& colleagues)

1. As noted elsewhere in this forum, the installation of 2015 erases whatever was in the "Configuration Properties / General / Target Name" window box and puts in $(ProjectName). For all configurations. Damned annoying . . .having to go back and type in all those name again.

2. In an effort to thread our larger code projects using OpenMP and the "parallel do" construct, I continue to encounter what appears to be thread scheduling problems. I establish 4 threads. Generally, I would expect the histogram that shows simultaneously used logical CPUs to show 1 CPU having the greatest amount of execution time (it's running all the serial code as well), and 4 CPUs to have the next highest amount of execution time. That is, either 1 CPU is running, or 4 are running (nearly). I do not see that. In most cases the histogram shows execution times for simultaneous CPUs, from greatest to least as: 1, 2, 3, 4. With 4 always vanishingly small. I don't understand why 2 or 3 CPUs are running simultaneously much more often than 4. The schedule is set to dynamic so with work should be being apportioned between the threads in a way that "automatically" balances the load. I am using the environment variable for affinity to place each thread on a physical processor. (I have 4 physical, 4 virtual). Nothing I do with scheduling affects this behavior. So, I assume that the difficulty is with what the OS is doing to schedule or allow threads to work. It is clear that I do not know enough about the interaction of Fortran/OpenMP and the OS (Win7 in my case). Can you recommend a resource that I can study to understand what is happening? Aside from the scheduling clause in the OpenMP statements, there are apparently other things that affect how threads behave.

I have raised this issue on this forum previously and several folks have offered good suggestions. But none of those suggested changes affected the thread / work load behavior I am seeing.

David

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

P.S. this is always accompanied by large "kmp_barrier" times -- far beyond the fractions I see when this behavior is not seen.

D

This type of behavior is expected whenever the work performed in the parallel region is relatively small as compared to the overhead of thread pool management. This happens also when the partitioning of the work cannot partition evenly (usually due to wrong assumptions between you and the compiler).

To confirm your report 2.) write a loop around a parallel region of a test code routine where on each iteration of this loop increases the work inside the parallel region. Where the test code routine is not your program code but rather one that you can produce known work loads. As the loop runs, gather your statistics of per core usage.

If you have issues in gathering those statistics in loop form then you may find it easier to have the program receive a command line argument that establishes the amount of work in the parallel region. The write a batch file that runs with various loads and collect the CPU times after each run.

The "kmp_barrier" times may include the time spent in the KMP_BLOCKTIME, which occurs after(between) parallel regions.

!$OMP PARALLEL PRIVATE(iThread, Time0)
iThread = omp_get_thread_num()
Time0 = omp_get_wtime()
!$OMP DO ...Your OMP DO statement here (sans PARALLEL)
DO...
...
END DO
!$OMP END DO
ThreadRunTime(iThread) = ThreadRunTime(iThread) + omp_get_wtime() - Time0
!$OMP END PARALLEL

Without seeing your program (and running it) it makes it difficult for us to offer you satisfactory suggestions.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Excellent. That is exactly what is happening: the small time for the work involved is being swamped by the thread overhead. I guess it's a warning about reading too much into test runs that are not large enough. And it seems likely (though it would be great to be certain) that the barrier time is being tallied into the elapsed time between parallel regions. Is see the large barrier times in those programs that repeatedly pass over large sections of code, in which one or more parallel sections exist. Perhaps the thread idling time is accounted as barrier time, since all but the master thread wait till the section parallel section is encountered.

David

The purpose of KMP_BLOCKTIME is to reduce the overhead of starting the next parallel region.  The otherwise idle threads will spin in the mean time and accumulate barrier time which should not degrade performance as far as wall time is concerned (as apparently you don't use HyperThreading).  You could verify by setting KMP_BLOCKTIME=0.

Please read Tim's suggestion fully. IOW the setting of KMP_BLOCKTIME=0 while reducing barrier time, is not necessarily good for performance. The use of  KMP_BLOCKTIME=0 is useful under a very limited number of scenarios. This includes, but is not limited to,

a) Your program has a mix of threading tool kits, each with a thread pool (e.g Fortran OpenMP and C++ TBB). In this case, eliminating the block time on exit from one systems parallel region to enter a parallel region of the other, then this generally improves performance. (Not always, but generally).

b) There is a fairly long gap between parallel regions (200 to 500ms) and you are running on a notebook/tablet. Removing the block time could conserver battery capacity.

Jim Dempsey

www.quickthreadprogramming.com

Understood.  In my applications there is no reason to do anything but let the threads idle between parallel sessions. As it happens, if I set kmp_blocktime to zero, the performance is (in some cases considerably) worse. Presumably, this is because the threads have to be awakened when encountering another parallel section.

David

Leave a Comment

Please sign in to add a comment. Not a member? Join today