Couldn't create more than 981 OpenMP threads with Intel(R) C++ Composer XE 12 Update 9 - RESOLVED - more than 18,607 threads created

Couldn't create more than 981 OpenMP threads with Intel(R) C++ Composer XE 12 Update 9 - RESOLVED - more than 18,607 threads created

Аватар пользователя Sergey Kostrov

I'd like to report an OpenMP related problem.

Intel Software Engineers statedsome time agothat Intel's implementation of OpenMP allows to create up to 16,384 threads.

I've just completed a test andOpenMP based applicationcompiled with Intel C++ Composer XE 12 Update 9couldn't create
more than 981 OpenMP threads:

Error messages are as follows:

...
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...

OpenMP Support was enabled in aVisual Studio's project: Generate Parallel Code (/openmp, equiv. to /Qopenmp).

My environment:

OS: Windows XP 32-bit
IDE: Visual Studio 2005 SP1
C++ compiler: Intel C++ Composer XE 2011 Update 9

Best regards,
Sergey

23 posts / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Sergey Kostrov
Аватар пользователя Vladimir Polin (Intel)

Hello Sergey, Do you have enough free memory and number of handles? It looks you have reached 2GB per process windows limitation. Could you work with 64 bit version to get more threads working? --Vladimir

Аватар пользователя Sergey Kostrov
Hi Vladimir,

I'll continue investigation today and keep you informed.

Quoting Vladimir Polin (Intel) ...Do you have enough free memory and number of handles?

Yes.

It looks you have reached 2GB per process windows limitation.

A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.

Could you work with 64 bit version to get more threads working?

No.

Best regards,
Sergey

Аватар пользователя jimdempseyatthecove

Without using the BOOT.INI option to instruct the 32-bit Windows to permit processes to use up to 3GB of user space, the user application is limited to 2GB (plus system space in upper 2GB address range).

With 2GB
Subtract code size
Subtract static data
Subtract main thread initial stack
The remaining memory is in your initial heap
Prior to creating your threads you may perform allocations, remove this from the amount of available memory.

Assume for example you have 1GB remaining.

Default thread stack limit is 1MB. Therefore 1000 threads could possibly be created in the remaining 1GB assuming they used no additional resources. *** and leaving 0 RAM for additional allocations ***

64-bit does not have this limitation.

Does your system have more than 981 logical processors?
If not, then why so many threads???

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя Tim Prince

The (important) facilities of each OpenMP for thread affinity are limited to the number of logical processors with hardware support on the supported systems (no Intel platforms currently support more than 248 logical processors, and not on Windows).

Аватар пользователя Sergey Kostrov

Jim, Tim,

I'll follow up on your posts some time later. Thank you for the feedback!

I'm simply overwhelmed by a number of different issuesand little problems related to integration of Intel C++ compiler withthe project.

Best regards,
Sergey

Аватар пользователя Sergey Kostrov

Hi Vladimir,

I still can't resolve the problem. Here is a new Test-Case 2and it reproduces the problem:

	// Test-Case 2 - Maximum number of OpenMP threads for Intel C++ compiler ( XE v12.1.3 )

	...

	uint uiNumThreads = 0;
//	uiNumThreads =  512;							// No Errors: Created 512 threads

	uiNumThreads =  981;							// No Errors: Created 981 threads

//	uiNumThreads =  982;							// OMP: Error #136: Cannot create thread

//	uiNumThreads = 1024;							// OMP: Error #136: Cannot create thread
	omp_set_num_threads( uiNumThreads );
	#pragma omp parallel for

	for( int i = 0; i < 4096; i++ )

	{

		int iValue = 2;

		printf( "Iteration: %4ld - Thread %4ld out of %4ldn",

			   ( int )i, ( int )omp_get_thread_num() + 1, uiNumThreads );

	}

	...


Could you forward my concerns to the Intel Engineering Team, please?

Best regards,
Sergey

Аватар пользователя Sergey Kostrov
Quoting Sergey Kostrov
Hi Vladimir,

I'll continue investigation today and keep you informed.

Quoting Vladimir Polin (Intel) ...It looks you have reached 2GB per process windows limitation.

A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.

Here is a screenshot ( ~110MB allocated ):

983 - 2 ( Default process threads of the test application )= 981

Аватар пользователя Vladimir Polin (Intel)

Hello Sergey, try this example

#include

#include
DWORD WINAPI thread_routine(LPVOID lpParameter)

{

	Sleep(20000);

	return 0;

}
int main()

{

	unsigned int uiNumThreads = 0;

	HANDLE h[10000];
	for (uiNumThreads=1; uiNumThreads<10000; uiNumThreads+=1)

	{

		h[uiNumThreads]= CreateThread ( NULL, 0, (LPTHREAD_START_ROUTINE) thread_routine, (LPVOID)uiNumThreads, 0, 0 );

		if( h[uiNumThreads] == NULL ){

			printf( "Kernel object limit is %dn",uiNumThreads);

			break;

		}

	}

	Sleep(1000);

	return 0;

}

to utilize all 2^24 kernel objects you need 64-bit OS and 64 bit application. there is nothing to do with OpenMP. --Vladimir

Аватар пользователя Vladimir Polin (Intel)
Quoting TimP (Intel) (no Intel platforms currently support more than 248 logical processors, and not on Windows).

In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)

--Vladimir

Аватар пользователя jimdempseyatthecove

What I think the task manager is failing to account for is the address space reserved by the threads as opposed to the page file space comitted. Let's see if I can explain (surmise) this.

When your test program starts, and runs up to, but before OpenMP starts, your virtual memory address space is something like this (order may differ

(4KB reserved) at 0x00000000
(static data) at +4KB
(code)
(initial heap)
(unmapped address) 2GB/3GB less above and below items
(reserved 4KB)
(main thread stack)
--------------------
0x80000000 or 0xC0000000 to 0xFFFFFFFF system address space of your virtual memory

If/when the heap expires prior to or following additional thread allocations, additional heaps are mapped/allocated/reserved from the unmapped address space (a portion thereof), assuming there is available address space.

Now then, when a new thread is allocated/created (the surmise part):

The O/S checks the unmapped address space to see if it has sufficient space for:

thread stack (default 1MB, you may specify differently)
guard page (4KB on x32)
optional thread context information (?KB)

These addresses come out of the virtual memory address space (assuming address space available)

*** Now then, until something is pushed onto the thread stack, more specifically a thread stack page (4KB page granularity), that formerly was an untouched page (4KB) of the thread's stack, had a reservaton of 4KB of the virtual address, but until touched, did not require physical memory nor page file space. The attempted touch causes (would cause) a page fault, then the O/S would map the page (assuming available page file space). A similar thing happens each time you add an additional heap (expand the heap).

What this means is your 981 threads have:

981x (default thread stack + 4KB guard) virtual address space consumed (~1GB)
981x (4KB touched stack + 4KB guard) RAM/pagefile space consumed (~8MB)

When the program attempts to allocate the 982nd thread there is no available virtual address space.

At least this is my assessment as to what you are observing.

As TimP ponted out, in OpenMP, creating more threads than you have logical processors is generally counter-productive.

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя Sergey Kostrov
Quoting Vladimir Polin (Intel) Quoting TimP (Intel) (no Intel platforms currently support more than 248 logical processors, and not on Windows).

In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)

I would be glad to verify it.

I finally resolved it and my test application created more than 16,384 threads. A maximum number of threads I was able
to see was18,623!

I'll provide more details later today.

Best regards,
Sergey

Аватар пользователя Sergey Kostrov

Hi Vladimir,

Thank you for the Test-Case.

Best regards,
Sergey

Аватар пользователя jimdempseyatthecove

>>A maximum number of threads I was able to see was18,623!

Yes, you can (by reducing the stack size) but on x32 what is the point?

In a compute bound system, more software threads than available hardware threads, is generally counterproductive. There may be a few outlier cases where a bad algorithm may see better performance (I should say may work). An example might be a poorly written mesh filter where node progress is blocked by waiting for other node(s) to complete. A better way to write this type of program would be to use a tasking based system where the software thread migrates from task to task as opposed to having more threads.

Jim Dempsey

www.quickthreadprogramming.com
Аватар пользователя Igor Levicki

I guess that some people get a knock out of testing whether compiler conforms to pubished specifications :)

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Аватар пользователя Sergey Kostrov

Hi Vladimir,

The "problem" was related to OMP_STACKSIZE environment variable. By default it is set to 2MB for 32-bit platforms
inIntel OpenMP library.I've changed the OMP_STACKSIZEto a minimal valueanda test application created significantly more OpenMP threads.

Screenshots are enclosed.

Best regards,
Sergey

Аватар пользователя Sergey Kostrov

Screenshot 1 ( Task Manager - Processes):

Аватар пользователя Sergey Kostrov

Screenshot 2 ( Task Manager - Performance ):

You can see that the test application crashed as soon as all available memory was allocated.

Аватар пользователя Vladimir Polin (Intel)

Good for you Sergey, I'm wondering whether you can find apractical application for your expiriments. --Vladimir

Аватар пользователя Sergey Kostrov
Quoting Vladimir Polin (Intel) Hello Sergey, try this example ...
  1. for(uiNumThreads=1;uiNumThreads<10000;uiNumThreads+=1)
  2. {
  3. h[uiNumThreads]=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)thread_routine,(LPVOID)uiNumThreads,0,0);
  4. if(h[uiNumThreads]==NULL){
  5. printf("Kernelobjectlimitis%d\n",uiNumThreads);
  6. break;
  7. }

...

Hi Vladimir,

Here are a couple of questions:

How many threads did it create on your system?
Is it a32-bit or 64-bit system?

By default your example creates Win32 threads with a 1MBstack size.

I'll provide results of my tests obtained with my own Test-Case some time later. I alsowould be glad to see
your results for a 64-bit system!

Best regards,
Sergey

Аватар пользователя Sergey Kostrov
Quoting Sergey Kostrov ...I'll provide results of my tests obtained with my own Test-Case some time later. I alsowould be glad to see
your results for a 64-bit system!
...

Here it is:

Total number of Win32 threads created on Windows XP Operating System ( 32-bit / 4GB of VM )
                     C/C++ compilers:

Stack size           MSC      BCC      MGW      ICC

in bytes
       1024        30,548   30,575   30,716   30,533

       2048        30,548   30,575   30,716   30,533

       4096        30,548   30,575   30,716   30,533

       8192        30,548   30,575   30,716   30,533

      16384        30,548   30,575   30,716   30,533

      32768        30,548   30,575   30,716   30,533

      65536        30,548   30,575   30,716   30,533

     131072        15,735   15,750   15,823   15,727

     262144         7,985    7,995    8,032    7,980

     524288         4,019    4,027    4,047    4,017

1MB 1048576         2,016    2,021    2,031    2,015

2MB 2097152         1,006    1,011    1,015    1,005

4MB 4194304           501      504      507      500
Note 1: Release configurations / All optimizations disabled / Console Win32 applications
Note 2: MSC - Microsoft C++ compiler v8.0

        BCC - Borland C++ compiler v5.5

        MGW - MinGW C++ compiler v3.4.2

        ICC - Intel C++ compiler v12.1.3

Best regards,
Sergey

Аватар пользователя Sergey Kostrov

I found a very interesting article regarding limits of Windows platform(s):

http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx

Зарегистрируйтесь, чтобы оставить комментарий.