Relationship between number of threads in OpenMP application and memory used

Relationship between number of threads in OpenMP application and memory used

I needed to evaluate memory requirements for an OpenMP application with different number of
threads in the parralel region. As a result of my R&D project I created that table:

# of threadsmemoryused

8 3.2 MB
16 3.4 MB
32 3.8 MB
64 4.6 MB * Limit for Microsoft's OpenMP DLLs
128 6.2 MB
256 9.4 MB
512 15.8 MB
1024 28.6 MB
2048 54.2 MB
4096105.4 MB
8192 207.8 MB
16384 412.6 MB
32768 822.2 MB * Limit for Intel's OpenMP DLLs
65536 1,641.4 MB 1.64 GB ** Extrapolated
131072 3,279.8 MB 3.28 GB ** Extrapolated
262144 6,556.6 MB 6.56 GB ** Extrapolated

It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.

A Test-Case was based on thecode from a post:

http://software.intel.com/en-us/forums/showthread.php?t=103375&o=a&s=lr

30 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Please sumbit your results if you will be able to verify a relationship between number of threads in OpenMP
application and memory used.

Best regards,
Sergey

Hi Sergey,
Don't you want to add some practical workload to your example like calculating pi or fibonacci numbers?
--Vladimir

Sergey,

OpenMP is designed as a tasking system as opposed to a thread system.

In a tasking system you generally set up the application where the number of software threads == number of hardware threads. Creating excess threads introduces excess thread context switching. OpenMP task switching occues between an exit of a parallel region and entry into the next parallel region. This can occure relatively rapidly as the operating system is generally not involved (unless the time interval between exit of parallel region to entry of next parallel region exceeds a tune-able threshold).

The only purpose of allocating more softwarethreads than hardware threads is when (some) software threads may get blocked (for I/O or lock). Note, waiting for locks tends to be compute bound (oversubscription is counter productive).

Jim Dempsey

www.quickthreadprogramming.com

Quoting Vladimir Polin (Intel)Don't you want to add some practical workload to your example like calculating pi or fibonacci numbers?

Hi Vladimir,

I don't considercalculation of PI or Fibonacci numbers aspractical or useful in my case. Wasn't it done
before? Yes, and many times. Did I personallyprogrammed that? Yes, many years ago as a matter of
learninghow arecursion works. When somebody has a free timeitwould be a nice programmingexercise.

I would be more interested to seeyour results and compare with my results. Thanks in advance.

Best regards,
Sergey

Hi Sergey,

Comparing the memory consumption of an OpenMP program that does nothing is not very useful. As Jim pointed out, OpenMP generally used in a model where the number of OpenMP threads match the number of (logical) cores in the system. Oversubscription can be done, but is not very useful in most cases when programming OpenMP. Please do not confuse "heavy-weight" threads that OpenMP (and other threading models like TBB) use with the light-weight threads of an OpenCL-style program.

In summary, OpenMP is very memory-efficient for large number of threads. A minimal thread in OpenMP just consumes a couple of bytes in memory (thread descriptors, some meta-data from the OpenMP runtime, and a small stack that contains a few function frames). The main memory consumption of an OpenMP thread comes from the private data (which can be as large a GBs if you allocate private arrays in an OpenMP region). Hence, without knowing (Wladimir pointed that out) the application and the memory demands of the parallel algorithm, it does not make much sense to investigate the memory consumption.

Does that help?

Cheers,
-michael

Sorry SergeyI concur with Jim here and can't find a rationale to get memory usage for application that does not take into account HW concurrency and runs >1000x slower than serial version.Do you have one?If you need more details of openmp memory model I can point to "The OpenMP Memory Model" article by Jay P. Hoeflinger and Bronis R. de Supinskior "Complete Formal Specification of the OpenMP Memory Model" article by Greg Bronevetsky and Bronis R. de Supinskithanks.Vladimir


To all guys who responded: Thank you.Could you do areal evaluation?

Quoting Michael Klemm (Intel)...
Comparing the memory consumption of an OpenMP program that does nothing is not very useful...

[SergeyK] This is exactly what I need to evaluate memory requirements.

...A minimal thread in OpenMP just consumes a couple of bytes in memory (thread descriptors, some meta-data from the OpenMP runtime,
and a small stack that contains a few function frames)...

[SergeyK] This is wrong andit depends on implementation.Pleasetake a look:

Default ThreadStack Size for Intel OpenMP:

IA-32 architecture : 2M
Intel 64 architecure: 4M

Default ThreadStack Size forMicrosoft OpenMP:

IA-32 architecture : ~256KB
Intel 64 architecure: No dataat the moment

As I stated, I appreciate if you spend a couple of minutes with testing in a realapplication instead of
spending time on almost theoretical discussionsand provide some numbers. Thanks in advance.

Best regards,
Sergey

Hi Sergey,

The 4M stack size is the maximum stack size that a thread may grow to. The pages to backup the stack is only allocated when you actually touch the data. Before that the stack size might be close to zero (or one page of 4K).

Regarding a real application: I have applications that are close to zero and I have an application that needs about 250 MB of stack per thread. Without knowing which application tyoe you're after, there is not much sense for us to provide data because this data will likely be just wrong for your application. My reply to you stated, in principle, that theoretical thoughts are useless. So, we're on the same page here :-).

Cheers,
-michael

Hi Michael,

I provided a link to a test case in my initial post. Please take a look as soon as you have time.

Best regards,
Sergey

>>>It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.>>>
IIRC M.Russinovich book "Windows Internals" states that the maximal number of threads cannot exceed 2^16.

>>>IIRC M.Russinovich book "Windows Internals" states that the maximal number of threads cannot exceed 2^16.>>>
Sorry I was wrong, it should be written that maximal number of GUI objects cannot exceed 2^16.
Regarding max number of created threads it is probably depends on the available resources i.e each thread's stack.If the granularity is 64kb thus for theoriticaly max number of created threads should be 31250 threads.

>>...If the granularity is 64kb thus for theoriticaly max number of created threads should be 31250 threads...

Note: It has to be for a 32-bit platform

There are actually so many things that affect that limit. Is 31250 threads for:

- Windows, or Linux, or another OS?
- Debug or Release configuration?
- Intel, or Microsoft, or MinGW, or GCC C++ compilers?

Here are results of my testing with different C++ compilers on a 32-bit platform:


							// Operating System: Windows XP 32-bit / Release configurations

				// In bytes	// C++ compiler    MSC      BCC      MGW      ICC

//	#define _STACK_SIZE	      0		// Threads created:

//	#define _STACK_SIZE	   1024		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	   2048		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	   4096		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	   8192		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	  16384		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	  32768		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	  65536		//		30,548   30,575   30,716   30,533

//	#define _STACK_SIZE	 131072		//		15,735   15,750   15,823   15,727

//	#define _STACK_SIZE	 262144		//		 7,985    7,995    8,032    7,980

//	#define _STACK_SIZE	 524288		//		 4,019    4,027    4,047    4,017

	#define _STACK_SIZE	1048576		//		 2,016    2,021    2,031    2,015

//	#define _STACK_SIZE	2097152		//		 1,006    1,011    1,015    1,005

//	#define _STACK_SIZE	4194304		//		   501      504      507      500

>>...Here are results of my testing with different C++ compilers on a 32-bit platform...

Here is a test-case:

#if ( defined ( _WIN32_BCC ) || defined ( _WIN32_MGW ) )
#define STACK_SIZE_PARAM_IS_A_RESERVATION 0x00010000
#endif

CrtPrintf( RTU("Sub-Test 39\n") );

RTuint uiStackSize;

// #define _STACK_SIZE 0
// #define _STACK_SIZE 1024
// #define _STACK_SIZE 2048
// #define _STACK_SIZE 4096
// #define _STACK_SIZE 8192
// #define _STACK_SIZE 16384
// #define _STACK_SIZE 32768
// #define _STACK_SIZE 65536
// #define _STACK_SIZE 131072
// #define _STACK_SIZE 262144
// #define _STACK_SIZE 524288
#define _STACK_SIZE 1048576
// #define _STACK_SIZE 2097152
// #define _STACK_SIZE 4194304

uiStackSize = _STACK_SIZE;
if( uiStackSize == 0 )
uiStackSize = 1048576;

RTuint uiNumOfThreads = 0;
HANDLE hThread = RTnull;
RTuint uiLastError;

while( RTtrue )
{
hThread = ::CreateThread( RTnull, uiStackSize,
( LPTHREAD_START_ROUTINE )ThreadRoutine,
RTnull, CREATE_SUSPENDED | STACK_SIZE_PARAM_IS_A_RESERVATION, RTnull );
if( hThread == RTnull )
{
uiLastError = ::GetLastError();
break;
}
uiNumOfThreads += 1;
}

CrtPrintf( RTU("Number of Win32 Threads created: %5ld with a Stack Size: %5ld\n"), uiNumOfThreads, ( RTint )_STACK_SIZE );
CrtPrintf( RTU("System Error : %5ld\n"), uiLastError );

>>>Here are results of my testing with different C++ compilers on a 32-bit platform:>>>

Thanks for the results, very interesting.
Yes I agree with you,but I think that maximal number of threads created by various C/C++ compilers on 32-bit Win platforms should be dependent solely on OS process and thread management API.
Moreover OS must alloocate and reserve user mode and kernel mode address space for EPROCESS structures and ETHREAD structures.
Many interesting information is contained in these structures.
If you are interested I can post dumps of various EPROCESS and ETHREAD structures.

>>... I think that maximal number of threads created by various C/C++ compilers on 32-bit Win platforms should be dependent
>>solely on OS process and thread management API...

You could easily verify my results on your computer.

My point of view is based on real results and these numbers actually depend on a quality of code generation of a C/C++ compiler and a number of dependent DLLs mapped to the address space of the test application. MinGW and Borland C/C++ compilers are creating very compact binary codes with minimal number of dependent DLLs.

Take a look at a last set of numbers:
...
#define _STACK_SIZE 4194304
...
Number of threads created with MSC = 501
Number of threads created with BCC = 504
Number of threads created with MinGW = 507
Number of threads created with ICC = 500

By the way, MinGW C/C++ compiler for a Windows platform by design doesn't rely on some Microsoft's CRT-like DLLs. Almost the same applies to Borland C/C++ compiler. Unfortunately, Intel C/C++ compiler's overhead is higher and that is why it allowed to create only 500 threads in the last test.

I have read an article written by M.Russinovich where he states that 32-bit process can create at maximum 2048 threads.
link:http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx

>>>Number of threads created with MSC = 501
Number of threads created with BCC = 504
Number of threads created with MinGW = 507
Number of threads created with ICC = 500>>>

Do not you think that these different result varying only in a few threads can be dependent on the momentary state of the OS which can vary between variuos compilers test cases.

>>>By the way, MinGW C/C++ compiler for a Windows platform by design doesn't rely on some Microsoft's CRT-like DLLs>>>
But it must call kernel32.exe exports.
Does MinGW replace MSVCRT libraries with its own?
.

>>>My point of view is based on real results and these numbers actually depend on a quality of code generation of a C/C++ compiler and a number of dependent DLLs mapped to the address space of the test application>>>

Yes , but please take into account also that those compilers can produce more or less compact code , but in the end OS must create and manage those threads and also allocate space for example for internal thread ETHREAD structures needed to represent a thread and this is not dependent on the compiler currently beign used.And this can also add increased memory usage to the number of threads created.
I have forgotten to add that all threads in some process share that process address space and mapping of DLL's is done at process address space resolution(granularity).

>>...Yes , but please take into account also that those compilers can produce more or less compact code...

Iliya,

I've done lots of development with these C/C++ compilers and I'm using these compilers for a very long time. I really don't understnad how somebody could talk about quality of code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?

Best regards,
Sergey

Citazione:

Sergey Kostrov ha scritto:

>>...Yes , but please take into account also that those compilers can produce more or less compact code...

Iliya,

I've done lots of development with these C/C++ compilers and I'm using these compilers for a very long time. I really don't understnad how somebody could talk about quality of code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?

Best regards,
Sergey

I agree with you, but one thing I cannot understand. How one of those compilers can affect thread and process creating,management and tear down mechanism.
I think that everything relatec to thread and process management is at exclusive control of OS and without global-wide modification of the internal OS mechanism compiler will not be able to optimize its code for max nunber of creating threads.

@Sergey
I respect your knowledge and I learn a lot by reading and discussing with you on these forum.

>>> code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?>>>

I was not talking about the quality of the code generated by those compilers,I simply was not able to understand how one of those compiler can affect OS system internal structures and mechanism without performing some kind of systm wide modification.For example hooking and intercepting CreateThread function and rewriting memory manager routines responsible for memory allocation neded for the thread creation.

>>...without global-wide modification of the internal OS mechanism compiler will not be able to optimize its code for max
>>nunber of creating threads...

Case 1: C/C++ compiler A creates a very compact ( with little overhead! ) binary codes and when these codes are loaded into memory they won't take additional amount of memory that could be used for a stack allocation when new threads are created. Let's say 555 threads will be created.

Case 2: C/C++ compiler B creates a less compact ( with lots of overhead! ) binary codes and when these codes are loaded into memory they will take additional amount of memory that could be used for a stack allocation when new threads are created. Let's say 444 threads will be created.

So, it is a 100% memory related issue and take a look at a table I posted. If you don't believe me try to run a test-case I've provided for MS and Intel C/C++ compilers and you will see how it works in a real environment. Check both test executables with MS Depends in order to see differences in a number of dependent DLLs for both compilers.

Modern C/C++ compilers have more overhead compared to legacy compilers and some Intel C++ compiler users are complaining about it ( including me ).

Borland and MinGW won "the race" because they have less overhead and less dependent on some DLLs. There is nothing else related to why these "max thread numbers" are different.

@Sergey

Thanks for the explanation.It seems that I have completely misunderstood your post when you wrote about code compacting.

Hi Iliya,

I'll create and upload a Visual Studio project for 32-bit and 64-bit platforms with the test-case. I hope that it will help everybody to clear as many as possible things with regard to that subject.

Best regards,
Sergey

PS: I really would like to see numbers for a 64-bit Windows 7 Professional OS!

>>>I'll create and upload a Visual Studio project for 32-bit and 64-bit platforms with the test-case. I hope that it will help everybody to clear as many as possible things with regard to that subject.

Best regards,
Sergey

PS: I really would like to see numbers for a 64-bit Windows 7 Professional OS!>>>

Thanks Sergey I will run your test case and post the results.Meanwhile I'm testing a FFT algorithm and I'm having very strange results with VS2010 compiler.For example FFT of 4096 sin function elements is complited in 160245 msec for 1e4 loop iterations , the same test compiled with VS2010 compiler executes the same code in 140451 msec and Intel C/C++ compiler is able to outperform VS2010 compiler at whooping speed of 905 msec per 1e4 loop iterations.
Here is the link :

>>> I really would like to see numbers for a 64-bit Windows 7 Professional OS!>>>

I have 64-bit Win 7 Pro russian edition it is installed as vmware appliance.If you are interested please prepare test case and I will run it.

>>... I'm testing a FFT algorithm and I'm having very strange results...

I simply would like to ask 'Please don't pollute the thread with unrelated problems, issues, etc'.

Thanks in advance. Sometimes I have the same problem!

>>>I simply would like to ask 'Please don't pollute the thread with unrelated problems, issues, etc'.>>>

Sorry for that I know that I should have had to create already a new thread solely for the purpose of FFT testing.Today I will do it.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi