Slow OpenMP performance when two applications run concurrently

Slow OpenMP performance when two applications run concurrently

Hi,

I'm using two different applications (one of them mine and the other one from a 3rd company) at the same time in my computer. The thing is that this 3rd app causes my program to be up to x10 times slower than one thread (Speedup = 0.1) when I use 5, 6 threads. If this 3rd app is not running, the Speedup is 4 or 5 with 5-6 threads.

I've noticed that both of them use libiomp5md.dll and my application is slowed down at some parts that my code has nested parallel regions and criticals. So I Im thinking that it could be a synchronization problem between process. Could be? The cpu monitor shows normal activity (5 threads, 80% of my computer) but with process explorer I can see no page faults, not IO operations and not memory changes

PD: When running my application alone, I can use up to 60-70 threads ( I only have 6 physical) without losing performance.

Thanks in advance!

18 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

It should help if you would set OMP_NUM_THREADS in each application so that you don't ask for more threads than your hardware supports.
You would need to launch the applications in separate shell windows, each with KMP_AFFINITY set to its own group of cores (paying attention to the cache vs. core configuration of your CPUs).
If you are running with hyperthread enabled, Windows 7 (with SP) and later have a better scheduler, but by itself the OS scheduler can't pay attention to cache locality. In a critical region, all but one core may be freed up long enough for the other application to grab the others, thus delaying exit from critical.

The number of threads is not more than the #physical cores. The CPU is an AMD, no KMP_AFFINITY allowed. It is possible that my resources are OK (cpu, RAM, buses, cache...) but there is some mess with synchronization interprocess? It should not interfere... could it?

I haven't had access to an AMD CPU for a long time, but it used to be possible to use the basic numerical settings, like
KMP_AFFINITY="proclist=[0-4],explicit"

No, you should arrange so that the instances of OpenMP use different groups of cores, probably associated with different cache locality. The applications still won't run as fast together as they would separately.

>>I can use up to 60-70 threads ( I only have 6 physical) without losing performance
Do not run more threads than you have physical thread (unless you finess in I/O tasks, which is not necessarily easy to do).

>>parts that my code has nested parallel regions
When you fully subscribe threads (e.g. 6 on your 6-thread processor) refrain from using nested parallel regions.
Note, each nest level instantiates a thread team (more software threads) for the additional team members of eac thread created at the new nest level.
2 nest levels of 6 threads each level require 6*6 number of theads (plus the OpenMP watchdog thread)
If your code is written to require nested levels, then you will have to take care in managing the number of threads you permit on each nest level. e.g. 2 on the outer most level, 3-4 on second level, 1 on greater levels. Or some other combinations you determine through testing.
If this becomes unmanageable then consider using the OpenMP Task capability (more programming on your part, but better results).

>> this 3rd app causes my program to be up to x10 times slower than one thread
When running multiple OpenMP applications, then consider either: reducing the thread count per applicaiton (TimP's suggestion), and/or setting KMP_BLOCKTIME to 0 and then letting each application duke it out for CPU time.

Jim Dempsey

www.quickthreadprogramming.com

I said 60-70 threads to show you than is not a resource problem. In fact, with nthreadsApp1 + nthreadsApp2 < nThreadsphysical this is happening, times increases x4, x5, x6... Also, moving the two applications to another (some) computer cause things to work fine, but Im having this problem with two Xeon 4x biprocessors and same binaries... The number of threads is always < of the OMP_NUM_THREADS variable, I controll this inside my application.

Another weird thing is in Microsofts process explorer, my slow and concurrent application has not any context switches in 30, 40 seconds, no memory allocation changes (not even one byte) and not IO operations. Could vtune help here?

Does the perfromance monitor show all hardware threads active?

If not .AND. (nthreadsApp1 + nthreadsApp2 < = nThreadsphysical ), check to see if you have the KMP_AFFINITY environment varible set to 'disabled', if not, set to disabled.

VTune can be configured to show all applicaitons. Something will likely show up. One potential situation is:

Both applications are using a DLL that instantiates its own OpenMP thread pool. MKL is one such example. Should both these applications use MKL and if MKL is configured to use OpenMP, then you may have nthreadsApp1 + 6 threads MKL app1 + nthreadsApp2 + 6 threads MKL app2. You can configure MKL to be single threaded.

Multi-thread the app - single-thread MKL
or
Single-thread the app - multi-thread MKL

Newer versions of MKL (I think) resolve this issue, older versions created a seperate thread team.

Jim Dempsey

www.quickthreadprogramming.com

>>...Another weird thing is in Microsofts process explorer, my slow and concurrent application has not any context switches
>>in 30, 40 seconds, no memory allocation changes (not even one byte) and not IO operations.

Please check 'Update Speed' setting in the Windows Task Manager:

[ Menu ] -> Vew -> 'Update Speed'

Also, do you use any log-files? If Yes, could it be a problem here?

mkl_domain_set_num_threads is already set to 1, MKL_ALL and no logfiles. Also, I've realized than:

Application with 8 threads alone (8threads): 54 seconds
Same execution with 9,10,11 threads (>#nPhysicalThreads)~= 60-70s
Same execution (8 threads) with another concurrent app more using OMP but only with one thread (8+1thread) ~=170 sec
Same execution (8 threads) with another concurrent app more using OMP but only with one thread (8+1thread) but without criticals in the code~=59 sec

So like I was starting to think, openMP is messing something in criticals synchronization or something. Is this normal?

Thanks in advance!

In the 8-thread app, when the first of these 8 threads enters a critical section, should that thread (owner of critical section) be time-sliced with the other application, then the progress through the critical section will be ~50% (assuming multiple context switches).

Possible work arounds:

1) Use named critical sections (appropriately). This will allieviate unnecessary waits when app has multiple different critical sections
2) Set environment variable OMP_WAIT_POLICY=PASSIVE or KMP_BLOCKTIME(0) or 1
3) Remove critical section, replace with omp_lock_t locks then use omp_test_lock (remember to init
#if defined(USE_omp_locks)
while(!omp_test_lock(YourLock)) Sleep(0); // release to other application
#else
#pragma omp critical YourCriticalName
{
#endif
...

#if defined(USE_omp_locks)
omp_unset_lock(YourLock);
#else
} // #pragma omp critical YourCriticalName
#endif

Jim Dempsey

www.quickthreadprogramming.com

But could critical interfere in other processes? In that case (time-slicing) the time should be at the most 2x slower.
Regarding the locks, this critical are embedded in a fine-grain parallel loop (very quick iterations). Criticals are not very scalable here, but neccesary. I think locks would retard more everything.

I've tested also with VTUNE (vtune.png) using OMP_WAIT_POLICY=PASSIVE and KMP_BLOCKTIME=0. The first time-trace is my application alone and the second one my application with another one using only one thread. Notice the area in blue (with OpenMP). With VTune the problem is kind of masked.

PD: For one reason, when I launch my app with the other one concurrently, this message is beign showed: "warning: Accurate CPU time detection was disabled. The profiled application produces a large number of events that tax the ability of Event Tracing for Windows to keep up with the logging frequency. Several buffers with events were lost during collection. Please see the Troubleshooting section of the product documentation for details."
and the results in the png are cut there because of that.

Anexos: 

AnexoTamanho
Download vtune.png108.72 KB

Criticals from one application do not contend for the lock variable with criticals of a different application however they do contend for use of processor cores.

If, for example, both applications have KMP_BLOCKTIME set to 200ms, then when the first thread of app A enters and app A contended critical section, and the other threads immediately attempt to enter, then app A has all threads in run state. When app B does the same thing, it too has all threads in run state. This would effectively halve the performance of both applications.
However this is not the whole story. When context switches occur, then threads of app B, running on the same hardware thread (core/HT) of app A, are evicting the L1, L2, L3 cache loads of app A. B experiences same effect.

If you intend to run more than one parallel application concurrently, then I suggest you introduce a mechnism to quickly determine the number of such applications, and thread requirements. On Windows you could use a memory mapped file. A simple example would be:

Each app selects (loop by loop) number of HW threads / number of apps.

You could then extend this to a finer grain where each app can observe the requirements of the other app and prioritize the number of threads it chooses to use in the upcoming loops.

Note, do not affinity pin your threads, unless you attempt to seperate thread sets
e.g app A 0 to 7 (omp thread to logical processor)
app B 7 to 0 (omp thread to logical processor)

Then you have an issue of what does app C do.

Jim Dempsey

www.quickthreadprogramming.com

Hi everybody,

>>I'm using two different applications (one of them mine and the other one from a 3rd company) at the same time in my computer.
>>The thing is that this 3rd app causes my program to be up to x10 times slower...

Any chance that the '3rd app' boosts priorities, for example to 'Above Normal' or 'High', for all working threads? Could you verify it?

>>The first time-trace is my application alone and the second one my application with another one using only one thread.

Could you run your application for a fixed number of iterations of the outer most loop. (one run without other app, another run with app). Your .PNG, top chart, which you reported made without other app running took 100.33s in shaded area, bottom chart, reported with other app running took 90.7s in shaded area. IOW with additional app running your application ran faster!

A lock and a critical section is basicly the same, and should have the same overhead. The difference here being I suggested you use omp_test_lock which permits you to control what to do when "critical section" is in use. In the code suggestion I made earlier I used Sleep(0). An alternate approach is

for(int i = 0; !omp_test_lock(YourLock); )
{
if(i==SpinMax)
Sleep(0);
else
++i
}

You determine what to use for SpinMax. The larger the SpinMax, the fewer the Sleep(0)'s *** but also the more interference with other app and subsequently the greater the interference with the thread owning the lock (higher probability of context switching between apps).

You can also modify the above to something like

static volatile int WaitingThreads = 0;
...
while(!omp_test_lock(YourLock))
{
#pragma omp atomic
WaitingThreads = WaitingTreads+1;
if(WaitingThreads >= nThreads-1)
Sleep(0);
#pragma omp atomic
WaitingThreads = WaitingTreads-1;
}

IOW sleep only when you are the last thread attempting to obtain lock (or use nThreads-2, -3, ...)

Jim Dempsey

www.quickthreadprogramming.com

Priority in all processes is all the same. I've checked it. Also I've checked that this is happening with two instances of my own program at the same time, because of the criticals.

In the vtune.png the brown CPU area is bigger in the chart below because of the VTUNEs warning I've shown before. The real time in parallel is marked in blue (each region delimited by two triangles). The conclusion in this chart is that there is not a bottle neck here, all parallel regions are slower when they have criticals and another OpenMP application is running (even this application runs with one thread)

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!