Compiler 12.1 OpenMP performance at multicore systems - libiomp5md.dll

Compiler 12.1 OpenMP performance at multicore systems - libiomp5md.dll

I have a problem with the new version of Intel C++ Compiler 12.1.0.43 (Parallel Studio 2011 SP1). My x86 application has degradation of performance about 30-40% at multicore systems (two x5690 processors, Windows 7 x64) comparing to the same code compiled with earlier version of C++ Compiler.

Withsmaller number of cores (single i7-M640) performancesare almost the same.

After some experiments I discovered thatsimple replacement of OpenMP DLL libiomp5md.dll with the earlier version resores previous performance. In particular I replaced libiomp5md.dll version 5.0.2011.606 with libiomp5md.dll version 5.0.2011.325.

Therefore the question is: what changed in libiomp5md.dll that could be a reason of such degradation? How can I restore previous performance?

As a note: I compared performances at relatively simple problems when small number of cores were sufficient, and growing number of cores most likely caused degradation. But I want to minimize this negative effect!

Regards,
Michael

38 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Michael,

Can you profile the app using eachlibrary on the 2P system?
You should be able to identify the routine in libiomp5md.dll that is introducing the delay.
A 2P system may have to reach out through RAM for some synchronization whereas a 1P may be able to synchronize within the last level cache. The difference you see between library versions may be due to thread juxtopositions in your application (causing competing threads to reside on each processor as opposed to one processor) .OR. due to bug fix in library .OR. inefficiency introduced into newer library. Posting your identification of the routine in the libiomp5md.dll may lead someone at Intel (reading this post) to explain/fix the problem.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Unfortunately the result is obtained at a very large and complicated project. It is difficult to extract a test case basing on this project.

Neverhteless I performed one more test at another dual processor computer: 2x Intel 5160 (4 cores in total). In this case I also see significant difference when mentioned in my first post versions of libiomp5md.dll are replaced. With the latest version of libiomp5md.dllI can clearly see a degradation on the level of 10-15% in this case. It is not so noticeable as in the case of 2x X5690 system, but also quite significant.

Should I create and submit a Report basingon these observations?

Michael

Michael,

I think without a simple reproducer submitting to Premier Support would be futile. They almost always require a reproducer. Your better route is to make notice of this issue here on this forum (as you have) with purpose of canvassing other users with regard to their expirences. I seem to recall similar issues with libiomp5md(mt).dll verses 1P and nP systems but I cannot recall which version reported this issue. Perhaps Steve L. or someone else can add to this observation. Running VTune or other profiler on the two systems/libiomp5md.dll combinations may help to identify the root cause: critical section, event, scheduler, adverse cache interaction, memory alignment, ...

There was also reported an issue where aligned malloc did not honor the alignment request. Although this is a C runtime library and/or O/S issue, the different libraries may have accidentally caused one to take a heavier hit in performance. The profiler may yield some insight as to what is happening. I know that this is not your job... your job is to worry about potential problems of retrograding the .dll version.

Jim Dempsey

www.quickthreadprogramming.com

Hi Michael,As Jim mentoned without reproducer it would be hard to understand where the problem is.Could you list OpenMP constructions are mostlyused in the program (function names, etc)?thanks,--Vladimir

Michael,

Perhaps you can do the following to shed some light on the problem _without_ sending in code.

Using a profiler make test runs using each library, have test run sufficiently long to produce reasonably accurate data (say 20 second run). Usualy the first report screen of a profiler will be routine names and percent of run time sorted High to Low by run time. Capture this report, preferrably as text as opposed to screenshot (easier to read text). If you can caputre the entire list it would be preferable (sum of lesser routines may eat up the 10%-15% difference).
What we would be looking for is one of the libiomp5md routines incurring additional overhead.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim, Vladimir,

Sorry for some delay, it looks like I was probably affected by

DPD200134977 C++, Fortran issue with libiomp5md.lib, reduction -:

In rare cases my application had (rare!) random hangs inside one of OpenMP constructs. It occurred if libiomp5md.dll 5.0.2011.325 was loaded, while exactly the same application worked just fine with libiomp5md.dll 5.0.2011.606. In both cases the rest of application remained untouched.

I performed the required test with Intel VTune Amplifier XE 2011. I believe the results show some basic difference between these two versions of libiomp5md.dll, in particular, how aggressively OpenMP threads are utilizing available CPU cores.

For the reference: tests were performed at the computer with 2 X5690 CPU, HT = On, SpeedStep and TurboBoost = OFF. Therefore 24 logical cores are available in this system. In both cases only libiomp5md.dll was replaced. OpenMP is used in a DLL, this DLL is loaded by GUI application written in Delphi. GUI part is also multithreaded, typically DLL functions are called from different secondary threads of GUI in order not to block UI during computations. I tested rather complicated algorithm requiring to call many different DLL functions (with OpenMP parallel constructs) in some externalloop. The test run takes about 20-30 sec.

A) libiomp5md.dll 5.0.2011.325 Summary of Lightweight Hotspots analysis

Elapsed Time: 22.565s

CPU Time: 336.113s

Instructions Retired: 599,580,000,000

CPI Rate: 1.938

Paused Time: 0s

B) libiomp5md.dll 5.0.2011.606 Summary of Lightweight Hotspots analysis

Elapsed Time: 29.971s

CPU Time: 88.789s

Instructions Retired: 143,308,000,000

CPI Rate: 1.874

Paused Time: 0s

Immediately we can see significantly less CPU Time parameter in the 2nd case!

Do you still need more detailed information on particular functions inside libiomp5md.dll?

At the moment I have a feeling that version 606 doesnt like when OpenMP parallel region is called by different threads of the application, like happens in my case.

Regards,
Michael

Are you trying to have multiple independent instances of OpenMP running under various parent threads? OpenMP isn't necessarily well adapted to such a situation, but the newer library may be noticing what is happening and limiting the number of threads.
Did you try using affinity options (e.g. setting non-overlapping affinity masks via KMP_AFFINITY for the various OpenMP instances)?

In my case instances of OpenMP are never running concurrently. They are launched sequentially one by one. Parent threads of GUI applications are often different, but they are synchronized in sequential order. Thus I do not see any need to use KMP_AFFINITY.

I checked values ofomp_get_num_threads,omp_get_max_threads, omp_get_thread_limit,omp_get_dynamic, kmp_get_blocktime,omp_get_schedule

All these functions return the same values for both versions of libiomp5md.dll.

-Michael

Michael,

You provided a total application summary report.
With VTune you can also get a function by function summary report.
Example screen shot below.
In your case you will want to find the function names in each report with high relative instructions retired differences.

Jim

www.quickthreadprogramming.com

>>GUI part is also multithreaded, typically DLL functions are called from different secondary threads of GUI in order not to block UI during computations.

Are you saying the GUI part is coded using say pthreads, or beginthread, or ... (non-OpenMP thread),
and which each of these threads may concurrently call the DLL,
and where each DLL called function will use OpenMP under the assumption that the OpenMP thread pool is entirely owned (by its call context)?

If so, you should understand that OpenMP is not designed to operate this way.

What you have here is a situation where multiple app GUI and non-GUIthreads have a "main" context (i.e. running outside parallel region). Then any number of these threads may concurrently enter its first parallel region under the assumption (or effect) that it is the first parallel region of the applicaiton. Internally this may cause adverse effects for OpenMP. Externally, observed by your application, potentially you have one thread in each of these concurrent DLL calls at it's independent main level with omp_get_thread_num() == 0. Would this cause any programming errors?

Jim Dempsey

www.quickthreadprogramming.com

Jim,

I created 2 reports. Unfortunately it is not easy to copy it as text in a readable form (why Amplifier has no export of the current report to HTML, for example?). Hope you will find required information (I am new to Amplifier XE, used only bundled Composer version before). So the screenshots ae below. I am afraid they are not very helpful, since the problem may be connected with the basic design of my software.

Regards,
Michael
==================

A) "Old" libiomp5md.dll 5.0.2011.325

Jim,

B) "New" libiomp5md.dll 5.0.2011.606

>> Are you saying the GUI part is coded using say pthreads, or beginthread, or ... (non-OpenMP thread),

Yes, some wrapper around beginthread/endthread (non-OpenMP).

>> and which each of these threads may concurrently call the DLL,

Not at all! By design I am avoiding concurrent calls of DLL functions having OpenMP parallel constructs. These calls are serialized. Other functions not having OpenMP inside are called, of course.

>> and where each DLL called function will use OpenMP under the assumption that the OpenMP thread pool is entirely owned (by its call context)?

Yes, it is designed in this way. I was assuming that serialization (see answer #2) makes this possible.

>> If so, you should understand that OpenMP is not designed to operate this way.

Some time ago I really had a problem when by accident (programming error) I had a concurrent call to two DLL functions with OpenMP. It caused stability problems, after I serialized such calls the problem gone and I have no problems with correct results, stability, etc.

Are there any recommendations how to use OpenMP in DLL for my scenario? If yes, where can I find a description?

>> Externally, observed by your application, potentially you have one thread in each of these concurrent DLL calls at it's independent main level with omp_get_thread_num() == 0. Would this cause any programming errors?

Even with a single thread for every parallel region the program should work correctly with an obvious impact on performance.

Regards,
Michael

The guide.gvs from OpenMP profile is already a text file, a more concise summary by parallel region of the time spent by each thread in major OpenMP functions.
.csv export (text file readable as spreadsheet) is a design feature of Amplifier; I don't know why restoring it is such a low priority.

Great, thanks! I hope it will help.--Vladimir

Michael,

The total run times for the two programs differ ~ 10:1 old:new
__kmp_fork_barrier, __kmp_x86_pause, __kmp_yield on old are ~100x onnew.

In the "old" run I notice many of the functions have duplicate names!!!!
Looks like you have two "omp" libraries loaded.
e.g. two different versions of libiomp5md.dll or a combination of libiomp5md.dll and some other lib(i)omp(5)(md).dll
letters in () are subject to change or elimination.

Jim Dempsey

www.quickthreadprogramming.com

Could you define environment variableKMP_BLOCKTIME=200 and try again?Or you can play with values e.g. from range 100-500.
--Vladimir

Vladimir,

KMP_BLOCKTIMEis intended to reduce adverse interaction between applications as opposed to within a singleapplicaiton. While reducing KMP_BLOCKTIME may improve this particular application, assuming this is a symptom of oversubscription, it will not correct the underlaying problem. The problem may be one of thesetwo hypothesis:

a) one or more components of this application were built with different versions of the OpenMP library and Windows Side-by-Side is dutifully loading the components to use their respective libraries. The end result potentially being thread unavailability to either or each library in a similar manner to interaction with different application. In this case, reducing KMP_BLOCKTIME would/may improve perfromance, but it will not fix the underlaying problem.

b) The two libraries are partially co-mingled between themselves (as opposed to running independently). The result of this is a near-deadlock as observed by the excessive times indicated in earlier post.

A potential way to determine a) or b) would be to count the number of threads used by the application.

a) ~= number of user application created threads + 2x number of threads expected for OpenMP thread pool
b) ~= number of user application created threads + 1x number of threads expected for OpenMP thread pool

Note, ~= due to OpenMP allocating the number of threads (specified/default) for its thread pool plus 0 or more helper threads. In either case Michael can compare total number of threads each version uses.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

>> Looks like you have two "omp" libraries loaded

Is it possible?? In the column "Module path" I see the same path/file name for all entries including duplicates. Also I checked again the list of loaded modules, the loaded file is correct.

It seems to me that duplicate entries appeared due to a bug in Amplifier. Before preparing the screenshot I played with different presentation of the result and tried to perform Compare operation.

Below is another run with the "old" version of libiomp5md.dll - fresh report.

Any recommendations to my scenario of use OpenMP DLL from non-OpenMP multithreaded GUI? Calls to OpenMP DLL functions are serialized, but typically performed from different secondary threads of GUI application.

Regards,
Michael

Vladimir,

I already checked that in both versions ("old" and "new"libiomp5md.dll) the function

kmp_get_blocktime() returns 200).

Thus there is no difference in this parameter, it cannot be a reason for slower execution.

Regards,
Michael

Jim,

I checked that both versions return the same values for functions:

omp_get_num_threads() -> 1 outside a parallel region and
omp_get_num_threads() -> 24 inside a parallel region

Other functions return the same result inall cases (parallel and plain regions, "old" and "new" versions):

omp_get_max_threads() -> 24
omp_get_thread_limit() -> 32768
omp_get_dynamic() -> 0
kmp_get_blocktime() -> 200

omp_get_schedule( kind, modifier ) -> kind = 1, modifier = 0

Anything else we could check?

What about submitting my application in binary form for testing libiomp5md.dll ? I have a DEMO version that is able to run this test case without limitations. If you are interested, please, send me instructions how to do this privately.

Regards,
Michael

Michael,

The blocktime still can be the reason of slow execution. Becauseit is not a static parameter, it may be changed by the library dynamically, e.g. when the number of threads exceeds the number of available compute units. So please try KMP_BLOCKTIME setting suggested by Vladimir, this setting will prevent the OpenMP library from changing behavior dynamically.

The library is not perfect, so when it sees that the number of threadsbecomes big it decides that this may cause oversubscription, and drops the blocktime to 0. Old library didn't do this and caused the problems in real oversubscription cases. In your case there is no real oversubscription, so it is better to keep blocktime with some reasonable value, not 0. It is not simple for the libraryto distinguish "real" oversubscription from "virtual" when most threads are sleeping, so the change in the library hurt your particular case, I think.

Regards,
Andrey

Michael,

Do not use omp_get_num_threads() inside a parallel region as this will return the number of threads available to the current OpenMP parallel region. What I am interested in is total number of threads and in particular total number of OpenMP threads in potentially(multiple) concurrent parallel regions. Postulation a) in prior message assumes multiple OpenMP thread pools (one per version of DLL componentor caller to DLL component(assuming multiple versions was reason for duplicate function names in VTune report)).

Sorry about (())'s

Jim

www.quickthreadprogramming.com

Jim,

It is unlikely that multiple OpenMP runtimes are executed in the application. By default the application should abort in this case. User should explicitelyset KMP_DUPLICATE_LIB_OK environment in order to be able to work with multiple OpenMP runtimes concurrently.

Duplicated names in Amplifier's report may be a problem of the Amplifier.

- Andrey

Andrey,

Super! It was exactly what I was asking for. This change affescted my applcation.I ended with

kmp_set_defaults( "KMP_BLOCKTIME=200" );

inDLL_PROCESS_ATTACH case of DllMain, since kmp_set_blocktime affects only calling thread settings.

Performance is restored back, thank you.

Regards,
Michael

Jim,

No problems with ((()))'s, I also often overuse them.

It looks like Andrey's guess was correct, please, see my answer.

Thank you for the efforts,
Michael

Best Reply

Andrey's guess may or may not be correct. Your application, with presumably a single OpenMP context, should not overscubscribe threads. You may need to reduce the OpenMP thread pool sizeby some number of your non-OpenMP threads. KMP_BLOCKTIME controls inter-process interactons (due to both processes being fully subscribed) but should not affect intra-process (due to single thread pool notexceeding total processing resources).

Shorter KMP_BLOCKTIME values is a convienence to release time to other processes (either single or multi-threaded). If reducedKMP_BLOCKTIME is increases the perfromance within a single process then this is indicative that this process is oversubscribed.

To confirm or reject multiple pools see if you can get the .DLL name(s) of the duplicated entries in VTune. This information might (should) be visible in one of the views.

The total number of threads, via report by thread will also be indicative of multiple thread pools.

If you observe multiple thread pools (DLL's library), then use one of the tuning options to generate a call tree analysis. The report will point to which thread root is calling which library. Also note, it might be possible to combine a different versioned static library with DLL library (although I have not tested this possibility).

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Thanks a lot for detailed explanations. I checked the situation again and can confirm that my application really creates an additional thread pool at some point. Thus total number of OMP Worker threads is growing from 24 to 48.

The typical situation when it happens looks as follows. One non-OpenMP thread calling DLL function with OpenMP parallel construct finishes and immediately another non-OpenMP thread starts. This thread also calls DLL function with OpenMP parallel construct. It is very important that time difference should be very small, this is why I started to notice this only at 3.5GHz X5690 system.

If I insert Sleep(200) before second call to DLL, an additional thread pool is not created, the application is running with 24 OMP Worker threads.

So I can conclude at this point that current implementation of OpenMP doesn't like calls from different non-OpenMP threads, even if these calls are properly serialized.

I guess that major revision of the threading model at GUI part of my application may be necessary. For example, only one computational non-OpenMP thread obtaining computational tasks of different kinds and managing their proper execution. With this approach all DLL functions having OpenMP parallelization will be always calledin the context of the same external thread.

Regards,
Michael

Michael,

This may be a work around hack.

With a single OpenMP thread pool you would normally want a KMP_BLOCKTIME of some reasonable amount (~200ms). With your app creating multiple pools setting KMP_BLOCKTIME to 0 would "mitigate" the situation somewhat (as you observed) but is not really what you want. If you do not mind experimenting try using KMP_BLOCKTIME = 200 then

int saveBlockTime = kmp_get_blocktime(); // Intel extension
kmp_set_blocktime(0);
#pragma omp parallel
{
if(kmp_get_blocktime() = 9999) printf("Not going to happen\m");
}
kmp_set_blocktime(saveBlockTime);
callYourDllHere(args);

See if this gives you the performance back (when multiple threads can call OpenMP).

Jim Dempsey

www.quickthreadprogramming.com

I forgot to mention (you probably figgured it out anyway)
The intention is to have a reasonable block time for parallel regions in your application
... except for the last region before your call into the DLL.
... this parallel region has 0 block time.

*** Note ***

This assumes you mutex-ized the calls to the DLL
*** and only call from the "main" of your app and any non-openMP threads launched ***
should you call from within a parallel region then all bets are off

Jim Dempsey

www.quickthreadprogramming.com

Hi Michael,

I've been using a very simply solution to track down an "unexplained"problem or performance degradation with aLogging API, instead of Code Profilers or Performance Analyzers.

Usually,it is a verytime consuming process, but you need toUNDERSTAND the problem, right? It means that after sometime you won't have a choice.

So, this is what I recommend andthis iswhat I've done many times in the past:

1.Create a txt-log filewith as simple as possible Logging API( you don't need anew complicated software subsystem which could bring another problems );

2.Integrate Logging calls in a software subsystem of your application which experiences performance degradation;

3. Start with just two Logging API calls, that is, when processing "Starts" and when processing "Ends";

3.Test as better as possible with a "Right-DLL";

4.Replace the "Right-DLL" with a "Wrong-DLL";

5.Test as better as possible withthe "Wrong-DLL";

6.Compare execution times ( what I understand you have already some statistics, but it looks like it didn't help);

7.Narrow down your search, that is, adda couple ofmore Logging API calls;

8. Repeat steps 3, 4, 5and 6;

9.Compare results and try to identifyallparts with differentexecution times;

10. Repeat steps3, 4, 5, 6and 7, and so on...

In overall, it could take many-many hours, or even days and weeks,of carefull testing and analyzing. But, I trully believe that you'll finally find a couple of code lines "responsible" for performance degradation.

Remember, that Internet activity, gaming, online chating, paging to a virtual file, etc, could affect execution times! Your tests must be done in comparable environments.

Here is anexample:

...

uiTicksStart = SysGetTickCount();
...
// Somepart of codesto be tested
...
uiTicksEnd = SysGetTickCount();

LogToFile( RTU("Completedin: %ld ticks\n"), ( RTint )( uiTicksEnd - uiTicksStart ) );
...

Also, I think it is a real problem for yourprojectthat you don't haveisolated test-cases.

Best regards,
Sergey

Senior C++ Software Developer

PS: Sometimes evenasimple output to a console window could help to identify the problem.

Michael,

The following conveyance of my thoughts are with respect of getting you running with your current code base as opposed to waiting for ompilib5 fix.

After thinking about your problem and symptoms I think I may now offer better advice. (difficult since I am doing this by proxy)

The release of kmp_blocktime should come _after_ the DLL call
However..... consider the situation:

(arbitrary thread outside parallel region)
for(...
{
dll_fn1();
dll_fn2();
...
dll_fnn();
}

Where each of the above functions use parallel regions.

When these functions have short lived parallel regions you would likely not want short block time, therefore it may be advantageous to place the release (kmp_blocktime(0); followed by all thread dummy parallel region) after the for loop.

When these functions are long lived you may or may not want long block times (experimentation warranted).

Added to this foray you apparently have other threads doing the same thing, with your current "fix" being a mutex before each call to DLL.

In the case above where you have the series of short lived functions, the mutex and per call release of thread pool will adversely affect your performance.

Considering the above train of thought possibly something like this may be worth investigating:

1) Remove the mutex.
2) Code all app created threads as if they were cooperative separate OpenMP processes (actually there is no code change but an awareness to your situation).
3) Add global variable:
volatile longcountOfActiveSessions = 0;
4) Add shared function
void EnteringSession()
{
_InterlockedIncrement(&countOfActiveSessions);
int nThreads = (omp_get_num_procs() // .or. omp_get_num_threads()
+ (omp_get_num_procs()/ 2))
/ countOfActiveSessions;
if(nThreads == 0) ++nThreads)
omp_set_num_threads(nThreads);
}
5) Add shared function
void ExitingSession()
{
_InterlockedDecrement(&countOfActiveSessions);
int old_blocktime = kmp_get_blocktime();
kmp_set_blocktime(0);
#pragma omp parallel
{
if(countOfActiveSessions .lt. 0)
printf("Not going to happen\n");
}
kmp_set_blocktime(old_blocktime);
}

5) Then prior to for(... loop containing series of DLL calls insert function call to EnteringSession, and following this for(... insert call to ExitingSession().
.OR.
prior to single call (or short run of calls)to DLL insert function call to EnteringSession and following call (or run of calls) insert Exiting session.

Not seeing your application it is hard to ascertain if the above is the best technique to apply, but this may be a good starting point for experimentation.

Remaining unknowns:

Does (would) each "Session" use all the threads?
e.g. parallel sections with small number of sections.

Are any of the calls nested in DLLor any of the calls to the DLLfrom nested region?
Note creating sessions for each nest level may be appropriate or may not. Some examination and experimentation may be warranted.

You may need to expand on EnteringSession/ExitingSession to take an argument containing a load or weight value for the session. The calling weight is added to the number of sessions and the portion of the weight to the new session count is used to prorate the reapportionment of the number of threads to use.

This should get you on track to optimizing your application.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Thank you for the proposal, it may be quite useful!

At the moment adding

kmp_set_defaults( "KMP_BLOCKTIME=200" );

In the DllMain restored previous performance...

...But also restored stability issue that I mentioned in my first posts. I believe it is connected with additional thread pool created by OpenMP when the next parallel construct is called too quickly from a different thread.

I will try proposed approach and also will try to redesign GUI part in order to have just one "Computing" thread receiving different work Tasks. It will require some time to implement, of course, since the project is quite compicated.

Thank you again,
Michael

Hi Sergey,

Thank you for the proposal. In fact any logging of this kind is critical for my application since it affects timings a bit. The problem often disappears when I introduce even a tiny delay before starting another non-OpenMP parent thread.

I performed similar tests using not log files, but kind of messaging system based on shared memory. These tests + Amplifier XP helped to reveal situatins when an additional (not necessary) thread pool is created by OpenMP system, now I am trying to find a way to avoid this.

And of course I have a set of test cases, but they are also binded to the main GUI application, since any test requires rather complicated preconfigured data structures that should be loaded from custom databases. There is an obvious disadvantage, I cannot submit such test cases for independent analysis, I can use them only internally.

Best regards,
Michael

Hi Michael,

I just want to clarify a little bit the way OpenMP runtime works with threads. When one master thread creates worker threads for the parallel region, these threads cannot be re-used in another master thread while the first master thread is alive. This happens because the OpenMP runtime keeps these threads for re-use by the first master thread as the OpenMP specification requires (indirectly), the runtime cannot know that you are not going to launch next parallel region from the same master thread. When the first master thread dies, its worker threadsmay be re-used in parallel region of another master thread, and youpossibly observed this situation in case Sleep(200) usage.

So in order to re-use the same thread pool for many paralell regions you should either launch all regions from the same master thread, or ensure that previous master is dead before starting parallel region in another master thread.

Regards,
Andrey

Michael,

I think I have some additional improvements to suggest.

Use the session techniquewith or without weights as outlined before andwith the following changes.

volatilelong nSessions = 0;
long nProcs = 0; // initialized at start of program
__declspec (thread) long priorThreadCount = 0;

void releaseThreadPool()
{
int oldBlockTime = kmp_get_blocktime();
kmp_set_blocktime(0);
#pragma omp parallel
{
if(nProcs .lt. 0)
printf("not going to happen");
}
kmp_set_blocktime(oldBlockTime);
}

void setThreadCount()
{
long currentThreadCount =
nProcs / (nSessions + (nSessions / 2)); // or your weight function
if(currentThreadCount .lt. priorThreadCount)
releaseThreadPool();
priorThreadCount = currentThreadCount;
omp_set_num_threads(currentThreadCount);
} // void setThreadCount()

void EnterSession()
{
_InterlockedIncrement(&nSessions);
setThreadCount();
}

void ExitSession()
{
_InterlockedDecrement(&nSessions);
releaseThreadPool();
}

...
(some thread outside parallel region)
EnterSession();
for(...)
{
setThreadCount();
DLLfunc1();
setThreadCount();
DLLfunc2();
...
setThreadCount();
DLL_lastFuncInLoop();
}
ExitSession();
----------- .AND./.OR. -------------
(some thread outside parallel region)
for(...)
{
EnterSession();
setThreadCount();
DLLfunc1();
setThreadCount();
DLLfunc2();
...
setThreadCount();
DLL_funcn();
ExitSession();
otherLongNonOpenMPfunction();
}

Jim Dempsey

www.quickthreadprogramming.com

Jim, Andrey,

I am very grateful for this fruitful discussion and for all advises. Finally I came up with the necessity to redesign threading approach at my GUI part (non-OpenMP). I will use thread pools/task paradigm, all DLL calls to functions with OpenMP parallel constructs will be performed from the same worker thread. I believe it will make current OpenMP implementation happier.

Indeed, I can see creation of another 24 working threads if OpenMP DLL function is called too quickly in a context of a different external thread. When all calls are serialized and bound to a single external thread, everything works just fine.

Regards,
Michael

Michael, and others reading these forum messages....

In the last post of mine outlaying a programming strategy for use in a single parallel application (process) calling a DLL (independently parallelized per calling thread), it should be intuitively obvious that with a little more work the "sessions" technique can be extended to multiple independent parallel processes. IOW the session count is stored/maintained in a system wide accessible object (e.g. registry, memory mapped file, etc...). The extension of this technique would yield a cooperative multi-threading amongst participating multi-threading processes.

ALthough this does not address Michael's current situation, it quite easily will address future situations. An example of this is Michael might at some time run two copies of his current application at the same time (on different data sets).

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today