OMP: Error #136: Cannot create thread ( and #6, #8, #178 ) - Fortran project that reproduces the problem attached

OMP: Error #136: Cannot create thread ( and #6, #8, #178 ) - Fortran project that reproduces the problem attached

Fortran project that reproduces the problem ( OMP errors ) attached. Output is as follows:

...
Matrix multiplication test
Enter No of ROWS / COLUMNS in A and B matricies ( integer ):
Recommended values: 1024, 2048, 4096, 8192, 16384, 32768, 65536, etc
32

Dimensions of matrices:

No of rows    N = 32
No of columns N = 32

Initializing...
Done...
Calculating...
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...

Notes:

- Win32 Release configuration needs to be used
- Options:

Fortran:
Optimization -> Parallelization = Yes ( /Qparallel )
Libraries -> Use Intel Math Kernel Library = Parallel ( /Qmkl:parallel )

Linker:
System ->
Heapk Commit = 268435456
Heap Reserve = 268435456
Stack Commit = 268435456
Stack Reserve = 268435456

268435456 = 256MB

- In total 1GB is reserved and ~1GB is still available for processing

 

AnhangGröße
Herunterladen forttestapp.zip43.47 KB
25 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Sergey,

On my system, Core i7 2600K (4 core, 8 HW threads)

256KB x 8 stack = 2GB + 256KB for heap + ?? code w/ libs .gt. 2GB in Win32

Code rus as x64

Try setting environment variable OMP_THREAD_LIMIT=4 (works here)

Jim Dempsey'

www.quickthreadprogramming.com

Jim, Let me put this again in a different form:

- Win32 Release configuration Must Be used in order to reproduce the problem
- There are No any problems ( at least I don't see them ) with x64 Release configuration

I think you completely missed Jim's point.  Each thread has to have its own stack.  From inspection of the .vfproj file you have /O3 set in conjunction with /Qparallel, so Matmul will involve multiple threads.  Typically the OMP subsystem creates one thread per virtual processor core.  How many such cores do you have? At 256 MiB a thread it isn't going to take too many to fill the usable address space in Win32.

Note that the commit figure is a subset of the reserve figure - you don't add them.

>>...Note that the commit figure is a subset of the reserve figure - you don't add them.

Ian, Sorry, but let me decide what we need.

I address my message to Intel software engineers:

There is an internal problem with the compiler, or OpenMP library, or MKL. I didn't miss anything and please take a look. Once again, the zip file has a test case that reproduces the problem with all these OpenMP error messages. A screenshot is in a zip file ( Img folder ).

My system is:

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit

Sergey,

Sequence of operations and comments.

Your program starts and runs up to the READ(*,*) N
At this point 1 thread is running (use Task Manager to confirm this for yourself).

At this time the CODE + HEAP(used) + STACK(used) == 5,936K (on my system)

However, the Task Manager is NOT telling you the complete situation. The Task Manager will tell you the amount of Page File Space reserved four your application. This ammount is dynamically determined as your application runs and touches (write or read) the memory.

Due to your project settings of Heap Reserve Size 268435456, and Stack Reserve Size 268435456 the virtual memory address space (on 32-bit Win32 totals 2GB) consumed is approximately: CODE(~5MB) + HEAP(256MB) + STACK(256MB) = ~ 517MB.

On your first call to MKL, you specified the parallel version of MKL, MKL will create and OpenMP thread pool of 8 threads (7 additional threads), each of which carves out 256MB of virtual memory address space of your remaining 2GB-.517GB, say 1.5GB of remaining virtual memory for your 32-bit Win32 process. By specifying 256MB for stack per thread, you can add an additional 5 or possibly 6 threads, but you cannot add an additional 7 threads without consuming more virtual memory than you have remaining to your process.

The fix for this is to reduce your Stack Reserve Size and Stack Commit SIze to a reasonable working size, say 4MB. This will require you to program in a manner such that any thread not exceed 4MB of stack. You do this by having your large allocations come off the shared heap as opposed to the thread's stack.

Jim Dempsey

www.quickthreadprogramming.com

>>...The fix for this is to reduce your Stack Reserve Size and Stack Commit SIze to a reasonable working size, say 4MB...

Did you reproduce the problem? I really don't understand the purpose of all these explanations ( from you and Ian ) when there is an internal problem with, as I already mentioned, the Fortran compiler, OpenMP library or MKL, or something else. I really expect that Intel software developers will take a look at it and will do an investigation.

The test project perfectly reproduces the problem and I'm Not the first developer who experienced this. The task is to look at why all these error messages are displayed:
...
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...
and why it happens. I remember that in 2012 several threads were related to 'OMP: Error #136: Cannot create thread' problem.

I think you have simply run out of thread stack space due to the way you've set the linker options.  The only one you should even consider using is stack reserve space. If I remove all your heap and stack settings, the program runs fine on IA32 with 8 threads up to a size of 8192. After that I get "insufficient virtual memory".  Trying to work around this with heap reserve/commit sizes is counterproductive - I have yet to see an application where those settings are useful. Stack commit is also inappropriate.

Please also keep in mind that for OpenMP you may need to set the environment variable OMP_STACKSIZE to set the per-thread stack size.

Steve - Intel Developer Support

>>Did you reproduce the problem?

Yes, I reproduced the problem.

Also, made the problem go away with setting maximum OpenMP threads to 4 (3 + main thread)

FortTestApp Property Pages | Configuration Properties | Debugging | Environment | OMP_THREAD_LIMIT=4

(enter the text OMP_THREAD_LIMIT=4 into the edit box)

Also, when NOT placing thread limit, .AND. removing Stack Reserve Size setting (0) and Stack Commit Size setting (0), this is to say use default (I think 4MB). The program also runs.

How many ways do we have to tell you to change your stack size specification or reduce your thread count. 32-bit Windows applications have only 2GB of virtual memory space regardless of physical memory. (you have a BOOT.INI option to extend this to 3GB). 64-bit Windows has the smaller of ~1TB or page file max.

Jim Dempsey

 

 

www.quickthreadprogramming.com

Once again and I'm really sorry to repeat this again.

I created a reproducer of the problem 'OMP: Error #136: Cannot create thread ( and #6, #8, #178 )' for Intel software developers and I expect it could help with investigation. The project is Not intended for anything else especially with different values in project settings for Heap/Stack Commit/Reserve values.

Guys, I did Not ask any questions in my first post ( please read it again ) because I know what All that stuff means and how it affects memory. Do you really think I don't understand it?

I consider that discussion is over.

Sergey, could you explain what you expect to happen, given the context you've described?

I don't see internal errors - I just see the OMP runtime complaining that the underlying operating system has (predictably, given your system and compile options) run out of a resource, followed by some secondary errors. 

Why do you think this is some sort of "internal error"?

Here's a shorter "reproducer" that doesn't use mkl.  I specify the number of threads because otherwise the default on my system wouldn't trigger address space exhaustion.

!$OMP PARALLEL NUM_THREADS(8)
!$OMP END PARALLEL
END

>ifort /Od /Qopenmp TooMuchStack.f90 /link /stack:268435456,268435456 && TooMuchStack.exe
Intel(R) Visual Fortran Compiler XE for applications running on IA-32, Version 13.1.0.149 Build 20130118
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.
Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
-out:TooMuchStack.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
/stack:268435456,268435456
TooMuchStack.obj
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.

Here's another "reproducer" that doesn't even involve OMP, or Fortran.  You can see that the OMP runtime is just passing on results and messages from the operating system.

#include "Windows.h"
#include <stdio.h>
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
  Sleep(100000);
  return 0;
}
int main()
{
  int i;
  HANDLE thread_handle;
  DWORD thread_id;
  DWORD last_error;
  CHAR *msg;
  
  for (i = 0 ; i < 8; ++ i) {
    thread_handle = CreateThread( NULL, 0, ThreadProc, NULL, 
        CREATE_SUSPENDED, &thread_id );
    if (thread_handle == NULL) {
       last_error = GetLastError();       
       FormatMessage( 
            FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM, 
            0, last_error, LANG_SYSTEM_DEFAULT, (LPTSTR) &msg, 0, NULL );
       printf( "Thread creation failed.nSystem error %ld: %sn", 
            last_error, msg );
       LocalFree(msg);
       return 1;
    }
  }
  return 0;
}

>cl TooMuchStack.c /link /stack:268435456,268435456 && TooMuchStack.exe
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
TooMuchStack.c
Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
/out:TooMuchStack.exe
/stack:268435456,268435456
TooMuchStack.obj
Thread creation failed.
System error 8: Not enough storage is available to process this command.

Refere to the docs on thread stack size that I looked up yesterday before posting.  Note that the actual stack size created is MAX(reserve,commit) and that CreateThread only lets you specify one of commit or reserve (some quick debugging shows that OMP_STACKSIZE influences the reserve size - which makes sense) - the other is taken from the executable defaults.  Because you've specified executable defaults for commit and reserve to be 256 MB there's no way that a Win32 subsystem program can then create a thread with a stack less than that size.

>>Sergey, could you explain what you expect to happen, given the context you've described?
>>
>>I don't see internal errors - I just see the OMP runtime complaining that the underlying operating system has (predictably,
>>given your system and compile options) run out of a resource, followed by some secondary errors.
>>
>>Why do you think this is some sort of "internal error"?

I will do my best and provide a detailed report, screenshots, description, etc, tomorrow.

Sergey,

Look at your first post in this thread. I will put an edited clip of your post here:

Linker:
System ->
Heap Commit  = 268435456 \
Heap Reserve = 268435456  \- (for the process (all threads))
Stack Commit = 268435456 \
Stack Reserve = 268435456 \- (Per thread)

268435456 = 256MB

- In total 1GB is reserved and ~1GB is still available for processing
** at the point in your program where 1 thread is running **
This is before MATMUL which uses MKL and attempts to launch (on your system) 7 more threads.

1 thread ~1GB available
2 threads ~768MB available
3 threads ~512MB available
4 threads ~256MB available
5 threads ~0MB available (may crash here)
6 threads -256MB available (will crash here)
7 threads -512MB available
8 threads -768MB available

 MKL, on your system, will (without limiting the thread count) attempt to create 7 more threads. Your options require each additional thread to be given 256MB of stack space. Your application runs out of memory before all of the requested additional threads have been created.

There is no reason for your program project settings to specify this large of stack. If, on the other hand, there is a insurmountable reason for having this size of stack, then on x32 you will have to reduce the number of threads your application will use. IOW you must either reduce stack size .OR. reduce thread count.

Jim Dempsey

www.quickthreadprogramming.com

There is an internal problem with processing because the test application hangs when trying to multiply a very small matricies, like two 2x2 matricies. Do you really think that 8 threads are needed to calculate the product of two 2x2 matricies?

Don't mix KMP_STACKSIZE or OMP_STACKSIZE with Windows module definition file ( def file ) values /STACKSIZE and /HEAPSIZE and please review MSDN.

If some Intel software developer decided to use a /STACKSIZE value as an input value for stack size in a Win32 API function CreateThread this is wrong and KMP_STACKSIZE or OMP_STACKSIZE have to be used instead and default values for Intel OpenMP library are as follows:

KMP_STACKSIZE - 32-bit platforms: 2MB
KMP_STACKSIZE - 64-bit platforms: 4MB

There is No any reason to use a /STACKSIZE value from a Windows module definition file ( def file ) as a stack size for a thread on Windows platform.

Sergey, I can't reproduce any hangs.

Steve - Intel Developer Support

Steve, I could create a small video as a prove of the problem. Do you want me to make it? Please let me know.

>>Do you really think that 8 threads are needed to calculate the product of two 2x2 matricies?

No, however the way your project is configured you have requested MKL, and for MKL to use a thread pool of size = 8 (iow add 7 additional threads). By NOT specifying a thread limit, the default becomes instantiate a thread pool with number of software threads == number of hardware threads (your system has 8 hardware threads).

Depending on implimentation OpenMP may create an additional watchdog thread, as well as may create an addional thread for buffered writes (though I cannot say what stack size it may choose for these potential additional threads). It is not productive for you to shirk your responsibility of managing the resources available to you.

Have you coded your program in such a manner as to require such a large stack space?

Assume for some reason known to you that you have chosen to make local arrays be stack based (as opposed to heap based)

PROGRAM foo
REAL :: ARRAY(67108864) ! ~256MB (force to be on stack)
!$OMP PARALLEL DO
DO I=1,67108864
ARRAY(I) = I
END DO
!$OMP END PARALLEL DO
END PROGRAM foo

In the above program, only the main thread requires 256MB of stack. The remaining threads are using a reference to ARRAY (pointer of sorts to the array). The remaining threads (in the above example) could function with 2MB of stack.

When the project configuration is set to make ARRAY as SAVE or as heap array, then the main thread of the above example could get by with a smaller stack.

Jim Dempsey

 

www.quickthreadprogramming.com

Sergey, I want you to remove all of the values for heap and stack under the Linker properties except for stack reserve. Then try again.

Steve - Intel Developer Support

Zitat:

Don't mix KMP_STACKSIZE or OMP_STACKSIZE with Windows module definition file ( def file ) values /STACKSIZE and /HEAPSIZE and please review MSDN.

If some Intel software developer decided to use a /STACKSIZE value as an input value for stack size in a Win32 API function CreateThread this is wrong and KMP_STACKSIZE or OMP_STACKSIZE have to be used instead and default values for Intel OpenMP library are as follows:

KMP_STACKSIZE - 32-bit platforms: 2MB
KMP_STACKSIZE - 64-bit platforms: 4MB

There is No any reason to use a /STACKSIZE value from a Windows module definition file ( def file ) as a stack size for a thread on Windows platform.

CreateThread only has one argument that is used to specify either the reserve size or the initial commit size.  The documentation for CreateThread and thread stack size selection explains that whichever one is specified in the API call, the other is taken from the executable defaults - i.e. from the linker settings

Put a breakpoint on the kernel32!CreateThread entry point and check for yourself - the OMP runtime does request a stack reservation that is based off OMP_STACKSIZE and friends.  But see my previous post - because the actual stacksize is MAX(reserve, commit) and because you have specified such a large commit (Why?  What's preventing you from relying on the operating system's automatic stack page commit system?) the environment variable becomes irrelevant.  You are getting exactly what you've asked for.

(Here (13.1.0) with my three line example, the OMP runtime's process exit routine fails to complete.  I think this is because during cleanup the runtime queries the state of the thread that failed to be created (or a thread that should have been created after the thread that failed).  This fails, the thread generates the secondary error (the GetExitCodeThread error), following which the process exit routine is called... again.  Hello recursion.  Early on in the second pass through that routine the runtime attempts to aquire a lock that it already has and perhaps a deadlock results.  This might be further complicated because part of that cleanup occurs during DLL_PROCESS_DETACH. But note your Fortran program is well and truely hosed by this point - the fundamental problem is that you are asking the system to do the impossible.  Controlled process termination from this sort of edge case scenario is always going to be a challenge.)

>>... I want you to remove all of the values for heap and stack under the Linker properties except for stack reserve...

A screenshot is enclosed. As soon as input values for matricies reach extreme values ( significantly greater then 2GB for a 32-bit bit application ) it crashes ( absolutely expected ) but an error message is incorrect and it needs to be something like No More Available Memory. You can see how much virtual memory my computer has.

Steve, I really need to do other technical things on a project and I expect that all the rest investigations, tests, etc, have to be done internally at Intel. Sorry, but it takes my time more and more, and as I mentioned, I've provided the test project for Intel.

Anlagen: 

AnhangGröße
Herunterladen fortraninternalproblem.jpg208.68 KB

I really would like to express my Thank You to Jim, Ian and Steve for the feedbacks but, unfortunately, I need to do different technical things.

Best regards,
Sergey

Sergey,

You have a case where "Virtual Memory" != "Virtual Memory" (a seamingly invalid statement).

System Total Virtual Memory is the address space available to the system expressed in 64-bit. This happens to be Page File Space + Physical Memory. The sysem can piece out any of this memory to any mix of 64-bit and 32-bit applications. A 64-bit could concievably use 64-bits of address space, but processor design may reduce this (52-bits, 48-bits, ??-bits). The processor on your system may be restricted to ~1TB of address space, however, the O/S is set up to provide only 128GB to the system.

Now comes the important part. Your choice of application is to make a 32-bit application to run as 32-bit application within 64-bit system. 32-bit programs have 4GB of addressability, however, on Windows 2GB of this addressability is reserved for (32-bit) O/S purposes (1GB if using large address). This means your 32-bit application onyl has 2GB of addressability (Virtual Memory) which can be placed anywhere in the 128GB of the Virtual Memory available to the 64-bit O/S.

The block you highlighted in red on lower left part of screen indicates Insufficient Virtual Memory _inside_ your available 2GB of Virtual Memory.

Reduce your stack size requirement.

Jim Dempsey

www.quickthreadprogramming.com

All virtual memory allocations which is done by user mode application end up calling VirtualAlloc API functions.When Memory Manager(which sits below VirtualAlloc) sees that system is running low on memory null pointer is returned to the caller.When 32-bit application runs in 64-bit Windows that application is given full 32-bit address space.On 64-bit Win application can use up to 8TB of address space.There is also some PTE related overhead on 32-bit and 64-bit versions of Win OS and moreover system resources need also be taken into account(like paged and non-paged pools MMIO space, PTE's).There is also a difference between reserved and commited memory.Commited memory is the sum of physical memory and page file,but sometime a process will fail to allocate 2GB limit because not whole physical memory address space available for the process will be commited by the OS at any moment.

Intel software engineers should take care of the problem if they consider it interested. I think the discussion is over.

I have not yet seen any evidence that there is a problem for us to solve. I can't reproduce a problem with sane values for the linker settings. As I wrote earlier, I think you used those settings in an attempt to get around 32-bit addressing space limits, but it just made things worse.

Steve - Intel Developer Support

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen