Couldn't create more than 64 OpenMP threads in a test application

Couldn't create more than 64 OpenMP threads in a test application

Hi everybody,

I recently done a test in a simple OpenMP based application andOpenMPcouldn't create more than 64 threads.

Here is a code of the test:

#include <omp.h>

void main( void )
{
int iShowNumOfThreads = 1;

omp_set_num_threads( 1024 );

#pragma omp parallel num_threads( 1024 )
{
if( iShowNumOfThreads == 1 )
{
iShowNumOfThreads = 0;
printf( "Number of threads created: %ld\\n", ( int )omp_get_num_threads() );
}

for( int i = 0; i < 16777216; i++ )
{
double dA = ( 2 * 4 * 8 * 16 );
}
}

printf( "Done\\n" );
}

How could I create as many as possible OpenMP threads? For example, more than 32,768?

Best regards,
Sergey

36 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Here is a screenshot for review:

I'd like to provide some additional information and a new question ( please see 3. ):

1. I alsoset an environment variable 'OMP_NUM_THREADS' to 1024 and it doesn't change the limitation.
A call to 'omp_get_max_threads' OpenMP function, like:
...
printf( "Max Number of threads: %ld\n", omp_get_max_threads() );
...
returns 64.

2. Only 58 threads are reported as exited. So, by some reason 6 threads are lost! Here is
a Visual Studio 2005 output:
...
The thread 'Win32 Thread' (0x204) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x1b0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc14) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x910) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x814) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x874) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x9b0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x980) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc4) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xbc0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xeec) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x8f4) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x30c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc30) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xca0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb44) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x778) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x9f8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa24) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x748) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xf70) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc88) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x678) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb94) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc5c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa70) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc84) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa98) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc74) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x4e0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xaa8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa44) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xbe8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa8c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x63c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xca4) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xcc0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x518) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc7c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x98c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb34) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xa60) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x96c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xf90) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xbfc) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xbe0) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xe14) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x5cc) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xcf8) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xf10) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x34c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb78) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xdc) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb1c) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xb98) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0xc94) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x6fc) has exited with code 0 (0x0).
The thread 'Win32 Thread' (0x6d8) has exited with code 0 (0x0).
...

3. I wonder if OpenMP version 2.0 ( March 2002 )has some limitations and doesn't allow to create more than 64 threads?

Ritratto di Vladimir Polin (Intel)

Hi Sergey,

corrected example

#include

#include 
int main( void )

{

     omp_set_num_threads( 1024 );
     #pragma omp parallel num_threads( 1024 )

     {

          if( omp_get_thread_num() == 0 )

          {

               printf( "Number of threads created: %ldn", ( int )omp_get_num_threads() );

          }
          for( int i = 0; i < 16777216; i++ )

          {

               double dA = ( 2 * 4 * 8 * 16 );

          }

     }
     printf( "Donen" );

     return 0;

}

And its output for Composer XE 2011 update 9

omp_test>omp_test.exe

Number of threads created: 1024

Done

Which compiler did you use?
update:
I wonder ifOpenMPversion2.0( March 2002 )has some limitations and doesn't allow to create more than64threads?

Specification does not set any limitation. It is up to implementation.

--Vladimir

Thank you forthe feedback.

>>Which compiler did you use?

The test was done with Visual Studio 2005.

>>...It is up to implementation.

Did Microsoft's implementation set some limits?

Best regards,
Sergey

Ritratto di Michael Klemm (Intel)

Dear Sergey,

The OpenMP specification does not set any limits on the number of threads, except for what the interface to the runtime routines accept as input values. So, you should be save about this.

However, the implementation is free to have internal limits (e.g. 64 threads max). I do not know if the MS implementation of OpenMP actually enforces a limit internally. You did not write about the machine you're working on. Is the machine a WSM-EX box with more than 64 cores (including the Hyper-Threading cores)? If yes, the limit might come from the fact the Windows processor groups are limited to 64 cores and that you need a new API to distribute threads across different processor groups. Alas, this has to be done from the OpenMP runtime and it might be the case that the MS implementations limits the thread number to the size of the processor group.

If you use the Intel OpenMP runtime you should not see any restrictions on the number of threads that you can create.

Cheers,
-michael

Ritratto di Vladimir Polin (Intel)

Intel OpenMP RTL also does have a limitation -32768 threads. But for me it is hard to imagine who needs these all threads on one machine.--Vladimir

Ritratto di Vladimir Polin (Intel)

Quoting Sergey Kostrov
>>...It is up to implementation.

Did Microsoft's implementation set some limits?

It is better to ask Visual Studio team.

But using VS2010 I've got the same 64 threads.

--Vladimir

Thank you, guys! I also confirm that this is a Microsoft's limitation. But, I managed to create 1,024 OpenMP threads.
Unfortunately, this is a "hack" and I'll provide technical details later.

Best regards,
Sergey

A C/C++ code with an OpenMP directive:
...
#pragma omp parallel num_threads( 1024 ) // 64 us a default value andwill be used instead
{
...

is compiled to several initialization calls in assembler language:

...
72881482 call _vcomp::min (...) // Here some verification is done
...
(1)7288148C push edx // A number of Win32 threads to create ( 64 )isin EDX register
7288148D mov ecx,dword ptr [ebp-4]
72881490 call _vcomp::PerThreadData::SetNextNumThreads (...) // Initializes some internal structures but Win32 threads are still not created
72881495 mov esp,ebp
72881497 pop ebp
...
004B72D6 call @ILT+11795( __vcomp_fork ) (...) // Creates Win32 threads and starts processing
...

At (1) a register EDX is already set with a maximum number of threads and this is 64. In the debugger
I changed the value of the EDX register to 1,024 ( 0x400 ). Then, a call to internal OpenMP function '__vcomp_fork'
creates 1,024 Win32 threads and starts the processing.

Here is a screenshot of the Windows Task Manager:

Note: 1,025 = 1 ( Win32 parent process ) + 1,024 ( Win32 threads created by OpenMP API)

Here is some information on an OpenMP DLL loaded by the test application:
...
Loaded 'C:\WINDOWS\WinSxS\x86_Microsoft.VC80.DebugOpenMP_1fc8b3b9a1e18e3b_8.0.50727.4053_x-ww_3f6e27c4\vcompd.dll', Symbols loaded (...).
...

Ritratto di Vladimir Polin (Intel)

Hi Sergey,Overriding of setting a number threads is not big deal. The big deal is to work with these 1024 threads in openmp constructs:)You need to know internal implementation to find out if this number of thread is supported or not internally. And of course it is unsupported officially.For example you can take either pi or fibonacci examples to see whether it still works for 1023 threads in this case. And looking into task manager do I understand correct that this 1024 thread application is executed in 1 thread (CPU field)?Other words if your application will crash in openmp runtime you can't come and say "i've hacked your library but it does not work"))))--Vladimir

Quoting Vladimir Polin (Intel)...And looking into task manager do I understand correct that this 1024 thread application is executed in 1 thread (CPU field)?...
Yes, that test was done on a computer with one CPU and the purpose of the test is simple - astress testing
of OpenMP library and evaluation of memory requirements forOpenMPapplication with a number of
threads greater than 1,024.

My current result is as follows: Microsoft's implemented OpenMP library v2.0doesn't allow to create more
than 1,977 threads. The application crashes when trying to create a 1,978th thread.

The OpenMP library 'vcompd.dll' throws an '0xC0000005' Access Violation exception.

Best regards,
Sergey

Hi Vladimir,

Quoting Vladimir Polin (Intel)Intel OpenMP RTL also does have a limitation -32768 threads...
Could you try toexecute three tests with8,192, 16,384 and32,768 OpenMP threadsin a test case I've submitted?

Could you report how much memory is allocated ( Mem Usage+ VM Size, please see the Task Manager )
for an application compiled inRelease configuration?

Thanks in advance.

Best regards,
Sergey

Ritratto di jimdempseyatthecove

Vladimir,

It is "presumptuous" of MS, or any vendor for that matter, to assume that all OpenMP threads within a user application are compute only threads. And therefore by assumption requesting more threads than hardware threads causes oversubscription and as a consequence MS, or any vendor for that matter, takes it upon itself to depreciate the number of threads requested.

A programmer may have valid reasons for specifying more threads than available hardware threads. One example is when you expect one or more of your OpenMP threads will be preponderantly waiting for I/O completion (including waiting for timer). Under such situations, not permitting the programmer to "oversubscribe" results in the application compute bound threads to be "undersubscribed".

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Vladimir Polin (Intel)

hi Jim,as I wrote before openmp specification does set lower and higher limits for threads count and every implementation will use as many maximum threads as they want.and this is up to customers to take the runtime library that fits best for their needs. But I believe if implementation offers maximum 2 threads nobody will use it.BTW, how can I/O jobs be implemented in OpenMP case, via tasking?--Vladimir

Ritratto di jimdempseyatthecove

>>BTW, how can I/O jobs be implemented in OpenMP case, via tasking?

Tasking, nesting, regions

...Assuming number of threads set to number of Logical Processors + 2

int nThreads = ...; // number of Logical Processors + 2
...
#pragma omp parallel
{
// here with number of threads set to number of Logical Processors + 2
if(omp_get_thread_num() == 0)
{
doReads(); // uses queue
}
else if(omp_get_thread_num() == 1)
{
doWrites(); // uses queue
}
else
{
omp_set_num_threads(omp_get_num_threads(nThreads-2);
// next region using number of Logical Processors
doWork(); // using # Logical Processors, reading doReads queue, writing doWrites queue
}
}

As to how you would oversubscribe, this would be an implimentation issue.

Jim Dempsey

www.quickthreadprogramming.com

Quoting jimdempseyatthecove...threads will be preponderantly waiting for I/O completion (including waiting for timer)...

Jim Dempsey

A similar approach is used in Windows CE.It creates some number of low priority Win32 threads and
they wait for data from high priority Win32 threads servinghardware interrupts.

Best regards,
Sergey

Ritratto di jimdempseyatthecove

Sergey,

In your OpenMP application you are free to create your own additional threads (e.g. _beginthread, ...)
However, you may experience some not-so-obvious issues when attempting to use OpenMP synchronization features between the OpenMP threads and the non-OpenMP threads. For example, OpenMP has a mutex lock as well as critical sections and atomic statements which may or may not work properly across thread domains (OpenMP and non-OpenMP). The documentation is written from the perspective of all threads are OpenMP.

Jim Dempsey

www.quickthreadprogramming.com

I have1 computerof 24cores andmyfortranprogramonly uses1.Someonecouldwritethat I canmake myfortranprogramusing the 24cores andreducethecalculation time?
I have theintelfortrancompiler2011 inRed Hat Linux.
thanks!

Ritratto di Vladimir Polin (Intel)

Sure you can
I might suggest to start from ISN pagehttp://software.intel.com/en-us/articles/getting-started-with-openmp/

Or search for"fortran openmp example" string in internet.--Vladimir

openmpis?openmpisa program todo that?ThanksBladimirwill reviewit.--Mel

Ritratto di jimdempseyatthecove

OpenMP is a syntax you can layer onto your C/C++/FORTRAN programs.

For C/C++ the syntax is in the form of #pragma omp... that you insert into your program and which can be enabled or disabled via compiler switches. In FORTRAN the syntax in introduced as compiler directives specified as comments, and which can be enabled or disabled using compiler switches.

for(int i = 0; i < N; ++i)
{
...
}

Becomes:

#pragma omp parallel for
for(int i = 0; i < N; ++i)
{
...
}

With the for loop being the same as without #pragma.
In FORTRAN

!$OMP PARALLEL DO
DO I=1,N
...
END DO

Should N be large enough, the iteration space will be partitioned by the number of threads available on your system (24 in your case). Each partition will run in parallel.

Note, some loops may require special considerations to avoid multiple threads from updating the same location at the same time.

Please look at the sample code in the documentation. Using OpenMP is relatively easy... but there are a few programming considerations you need to follow if you want correctness and performance.

Start with simple improvements to your code and then get more aggressive as you gain experiance.

Jim Dempsey

www.quickthreadprogramming.com

Quoting jimdempseyatthecove...
In your OpenMP application you are free to create your own additional threads (e.g. _beginthread, ...)
However, you may experience some not-so-obvious issues when attempting to use OpenMP synchronization features between the OpenMP threads and the non-OpenMP threads. For example, OpenMP has a mutex lock as well as critical sections and atomic statements which may or may not work properly across thread domains (OpenMP and non-OpenMP)...

An application of the'_beginthread' function is another option to consider. Thank you.

Best regards,
Sergey

Quoting jimdempseyatthecove...
Please look at the sample code in the documentation. Using OpenMP is relatively easy... but there are a few programming considerations you need to follow if you want correctness and performance.

Start with simple improvements to your code and then get more aggressive as you gain experiance.

Jim Dempsey

Hi Jim,
Thank you for the feedback and I really appreciate it. There is only one problem at the moment, that
is, alack of time. I can't work 24 hours a day... :)
Best regards,
Sergey

Quoting Michael Klemm (Intel)...
However, the implementation is free to have internal limits (e.g. 64 threads max). I do not know if the MS implementation of OpenMP actually enforces a limit internally...

Hi Michael,

I tried to findexplanation(s) on MSDN website and I found a very interesting statement at:

http://msdn.microsoft.com/en-us/library/d8wkzt26(v=vs.80).aspx

...
The omp_set_num_threads function sets the default number of threads to use for subsequent parallel
regions that do not specify a num_threads clause.
...

When I removed the 'num_threads' clause nothing has changed andmy test couldn't create more
than64 threads. So, I'll try to contact Microsoft and let's see what they say.

Best regards,
Sergey

Ritratto di jimdempseyatthecove

Sergey,

>> I'll try to contact Microsoft and let's see what they say.

If you are compiling with the Intel toolchain the OpenMP library will be that provided by Intel. IOW any thread limitation will be imposed by the Intel code.

If you are compiling with the MS toolchain the OpenMP library will be that provided by MS.

Who you contact will depend on who's toolchain you use.

Note, in an earlier post you showed:

 ...

     72881482  call         _vcomp::min (...)                                                          // Here some verification is done

     ...

7288148C  push        edx                                                                                         // A number of Win32 threads to create ( 64 ) is in EDX register

     7288148D  mov         ecx,dword ptr [ebp-4]

     72881490  call        _vcomp::PerThreadData::SetNextNumThreads (...)     // Initializes some internal structures but Win32 threads are still not created

     72881495  mov         esp,ebp

     72881497  pop         ebp

     ...

     004B72D6  call        @ILT+11795( __vcomp_fork ) (...)                                    // Creates Win32 threads and starts processing

     ...

You might consider hooking that library function or replacing it (assuming you do not get a satisfactory work around from MS).

Coersion sometimes works.

In the above dump, look at where the args to _vcomp::min came from.If it is from an environment variable (or result of lack thereof) then use that environment variable. If not, then create a static object, loaded early in your image, whos ctor makes an appropriate adjustment to the arg that is restricting your desired thread count.

These additions will not be portable, so inclose them in an appropriate conditional compile section, perhaps including a #pragma message("Hack to bypass MS restriction on upper thread count")

Jim Dempsey

www.quickthreadprogramming.com

Quoting jimdempseyatthecoveSergey,

>> I'll try to contact Microsoft and let's see what they say.

If you are compiling with the Intel toolchain the OpenMP library will be that provided by Intel. IOW any thread limitation will be imposed by the Intel code.

If you are compiling with the MS toolchain the OpenMP library will be that provided by MS.
...

I've submitted a feedback / questionon MSDN and I hope that somebody from Microsoft will explain that
limitation with 'vcomp.dll' / 'vcompd.dll' DLLs.

Best regards,
Sergey

Thanks, Jim.

Quoting jimdempseyatthecove...assuming you do not get a satisfactory work around from MS)...

Here is an update from Microsoft:
...
We are rerouting this issue to the appropriate group within the Visual Studio Product Team for triage and
resolution. These specialized experts will follow-up with your issue.
...

Here is a response:

...
The internal limit on the number of threads is indeed 64, and was directed by the limit of the number of
virtual processes available on a Windows PC back a few years. The situation has improved with the 64-bit
versions of Windows 7 (see http://windows.microsoft.com/en-US/windows7/products/system-requirements).
We will fix our internal OpenMP limits in a future release. Thanks for reporting this issue.
...

Unfortunately, it is not clear in what release it will be fixed. There aremanyversions of Visual Studios at the moment.

Thanks for pushing this, even though it's only of academic interest to some of us. I was somewhat surprised that you got a response at all. As Intel has been producing Westmere-EX platforms with 80 logical processors for some time, and promotes OEMs designing platforms with more, there has been some dismay at the 64 thread per partition limit in Windows.

Quoting TimP (Intel)Thanks for pushing this, even though it's only of academic interest to some of us.

[SergeyK] There is a significant practical interest for mebecause OpenMP is considered for some
project with high number of threads and strict portability requirements. Microsoft's
AMPtechnology is not considered. Microsofttries to "kill" any tecnology with a key
word 'Open'. OpenGL is another example.

I was somewhat surprised that you got a response at all.

[SergeyK] I'm a little bit disapponted with the Microsoft'sresponse because it is absolutely not clear
in what release it will be fixed. That is, in some version(s) of Visual Studio or a
Windows OS.

That limitation is related to a maximum number of wait objects on Windows platforms.

There is a definition in 'winnt.h' header file:

...
#define MAXIMUM_WAIT_OBJECTS 64 // Maximum number of wait objects
...

Hi Sergey,

IntelOpenMP libraryallows to create up to 32,768 threads ina parallel region.

Did you follow a link I provided? Please take a look. As soon as I applied a "hack" in the VS Debugger theMicrosoftOpenMP library ( vcompd.dll )was able to create more than 1,024 threads.

Also, where did you see a limitation for '...OS calls to WaitAll and WaitOne...'? What Win32 API functions are you talking about? Could you give me exact names, please?

Best regards,

Paul Gregorie

accessoireinformatique

vidosurveillance-alarme


Hi Paul,

Quoting paul_oxy...
Did you follow a link I provided? Please take a look. As soon as I applied a "hack" in the VS Debugger theMicrosoftOpenMP library ( vcompd.dll )was able to create more than 1,024 threads.

Also, where did you see a limitation for '...OS calls to WaitAll and WaitOne...'? What Win32 API functions are you talking about? Could you give me exact names, please?
...

Actually these are mycomments onParallel Computing General forum with another software developer.

Best regards,
Sergey

Ritratto di Abhishek 81

Thanks Sergev for the Information, I am studying the posts.

Abhishek Nandy

As you can see some number of spaces are deleted in posts and sometimes it looks like:

>>...mycomments onParallel Computing...

It happened during upgrade from the old ISN web-site to the new IDZ web-site but texts are still readable.

Accedere per lasciare un commento.