Determining Root Cause of Segmentation Faults SIGSEGV or SIGBUS errors

 

 

Problem : When I run my code compiled with the Intel Fortran Compiler I get 'sigsegv' on linux (or sigbus on Mac OS X).  This code has run fine for years on <insert your previous compiler/platform>.  Is this a bug with the Intel Compiler?


Environment : linux or Mac OS* X


Root Cause : There are many possible causes.  A segmention fault (bus error Mac OS X) is a general fault that can have multiple causes.  We outline these potential causes below and give suggestions for avoiding the segmentation fault


Possible Cause #1 Fortran Specific Stackspace Exhaustion: Solution, -heap-arrays compiler option.

The Intel Fortran Compiler use stack space to allocate a number of temporary or intermediate copies of array data. 

NON-OpenMP and NON-Auto-parallelized Applications: IF your program is not using OpenMP or Auto-parallelization (-parallel compiler switch) and your compiler is newer than Linux v9.1.037 (or all Mac OS* compilers), try the -heap-arrays compiler option.  OpenMP or Auto-parallelization users and users with Linux compilers older than v9.1.037 please read ahead to Possible Cause #2 for tips on unlimiting the stack size.

-heap-arrays

If this removes the sigsegv or bus error, you may STOP at this point.  You may wish to read the attached PDF presentation (link at bottom of page) to learn about when and where array temporaries are created.  With a few code changes you may be able to avoid some array temporaries, and hence reduce your application's need for temporary copies (improves performance).  Also, the -heap-arrays compiler option can take an optional argument [size] to specify the threshold size in Kbytes at which arrays larger than [size] are allocated on heap, all others on stack.  For example:

-heap-arrays 10

puts all automatic and temporary arrays larger than 10Kbytes on heap


Cause #2 Stackspace Exhaustion.   Solution: Unlimiting Stacksize for OpenMP Applications or any Application:
The first step is to try to increase your shell stack limit on Linux and Mac OS* X.  However, this option can have unwanted effects on data sharing with OpenMP or auto-parallelized code.  Because of this, OpenMP and auto-parallelization users are advised to not use -heap-arrays and insted try to unlimit their shell stack size limit.

Linux, bash:    ulimit -s unlimited
Linux, csh/tcsh:   unlimit stacksize

You may check your stack size limit with:
bash:  ulimit -a
csh:    limit
and look for 'stack size' limit for your shell environment

Notes:  If you run your program under the control of a batch subsystem you may need to add the command above to your user startup files ( ~/.bashrc  ~/.profile  or ~/.cshrc )

For Mac OS* X, there is a hard upper limit on the shell stacksize.  For most systems, this is:

bash:  ulimit -s 65532

which sets the limit to 64Mbytes.

An alternative is to use a linker option to increase the executable's default shell stacksize, as documented here:  /en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault

Re-run your application, if this fixes the issue you may stop.  If your application still generates sigsegv or bus error, continue reading.

Possible Cause #2-prime: Stack Exhaustion due to Heap  or general memory exhaustion
In the process memory map, heap and stack grow towards each other.  If they collide, this too can cause a segmentation fault on either the heap allocation or the next stack allocation. 

It is also possible to exhaust all of physical memory + swap space with an application.  Remember, with a 64 bit application, your VIRTUAL memory is practically unlimited.  However, the realistic amount of memory that can be consumed has a ceiling at PHYSICAL ram + Swap space (typically 2x the physical memory size).  you can get this information with the 'free' command.  Physical memory is also shown by 'cat /proc/meminfo' with fields 'MemTotal' and 'SwapTotal'.  They system typically needs some space, so a rule of thumb is to keep memory footprint of your application to around 80% of MemTotal if possible and never exceed MemTotal + SwapTotal.

Compile and link with -g -traceback to locate where you code is aborting.


Possible Cause #3, Stack Corruption Due to User Coding Error. There are a number of user coding errors that can cause stack corruption and lead to a sigsegv or bus error at run time.  These errors are particularly hard to find since the segmentation fault may occur later in the program in a section unrelated to where the stack was initially corrupted.

The first step is to try to isolate where in the code the fault occurs.  This is done by generating an execution 'traceback'.  Compile and link using the ifort driver and these options:

-g -traceback

When the code faults, you will often get a report showing the call stack when the fault occurs.  If you do not get a stack traceback, insure that you have used -g for both compilation and link and make sure that -traceback was used on the compilation.  There are cases where the seg fault occurs while the program is in kernel space and thus no user stack is available for trace back.  We are working to improve this in a future release.

This trace back report is read from the bottom of the list upwards.  Find the uppermost subroutine or function from your code along with it's line number to isolate which instruction caused the fault.  Check for user coding errors at this statement.  If no obvious user error, continue below


Possible Cause #4, exceeding Array Bound.  Solution, try -check bounds

The -check-bounds compiler option provides a run-time check of array accesses and character string expressions to insure that the indices are within the boundaries of the array.  This checking is useful to find cases where the indices go outside of the declared size of the array.  This option has a big impact on performance, the magnitude of which depends on how many array accesses are performed in the application.  Also, -check-bounds array bounds checking is not performed for arrays that are dummy arguments in which the last dimension bound is specified as * or when both upper and lower dimensions are 1.   To enable bounds checking, compile with:

-check bounds -g

and run your program.  The checking is performed at run time and not at compile time.  If this finds your error STOP.  ELSE keep reading.


Possible Cause #5, calling a function as a subroutine, or invoking a subroutine as if it were a function. 

These are user coding errors where a user does something similar to this:

--- main program ---
...
call ThisIsIllegal( some_arguments )
...
--- end main program ---

--- ThisIsIllegal ---
integer function ThisIsIllegal( some_arguments )
...
--- end ThisIsIllegal ---

In the example above, the main program calls ThisIsIllegal as if it were a subroutine, however ThisIsIllegal is declared as a function.  This can cause stack corruption.  To test for these conditions, try using compiler option

-fp-stack-check -g -traceback

compile with these options and run.   If the stack is corrupted by something similar to the above, your code will exit and give a stack trace.

You can check the interfaces of your procedures with a compile time check:

-gen-interfaces -warn interfaces

This compile time check will generate INTERFACE blocks for your procedures.  The -warn interfaces will then use these compiler-generated interfaces and check the calls to your procedures to make sure arguments and interfaces match between caller and callee.   Note that this check occurs only for Fortran source files.  This will not check interfaces in mixed language program.


Possible Cause #6, large array temporaries caused by passing non-contiguous array sections. Solution, detect with -check arg_temp_created and fix with coding change to include explicit interface and assumed shaped arrays.

Consider this 'before' example:

--- main program ---
real(8) :: f(1800,3600,1)
external sub
...
call sub( f(1:900,:,:) )
...
--- end main program ---

and the "sub" subroutine is in a separately compiled source file:
--- external subroutine "sub" ---
subroutine sub( f )
real(8) :: f(900,3600,1)
...
--- end subroutine "sub" ---

In this case, "sub" is expecting a contiguous array of size 900x3600x1.  However, the call is passing an array that is not contiguous in memory.  In situations such as this, the compiler will make an array temporary at the call to copy the elements of the array "f" from non-contiguous chucks into a contiguous array such as what "sub" is expecting.  This temporary is allocated on stack unless -heap-arrays is specified. 

To check if this is occuring in your code, compile with
-check arg_temp_created

and run the program.  Messages will be written when argument temporaries are created.  To work around the issue, creating a explicit interface and using an assumed shaped array in "sub" will remove the need for an array temporary:

--- main ---
real(8) :: f(1800,3600,1)
interface
subroutine sub(f)
real(8) :: f(:,:,:)
end subroutine sub
end interface
...
call sub( f(1:900,:,:) )
...
--- end main program ---

--- "sub" ---
subroutine sub( f )
real(8) :: f(:,:,:)
...
end subroutine sub

Keep in mind, that although this avoids the array temporary, within "sub" the compiler is now aware that the array "f" may be non-contiguous.  Thus, some optimizations on statements using "f" may be disabled and thus affect performance.


Case NONE OF THE ABOVE:  Solution, more in-depth analysis is needed

99% of the sigsegv or bus error cases tend to fall into the categories above.  However, there are other cases where segmentation faults can occur. 

If your application is linking in external libraries, make sure that the library is compatible with your compiler.  Was the external library compiled with the Intel Compiler?  If so, were the major versions the same - that is, was the library compiled with Intel Fortran v9.1 but your application built with Intel Fortran v10.x or v11.x?  Intel only guarantees compatibility within major versions ( 9, 10, 11 are examples of major versions). 

If the external library is from a software vendor or tool:  does this vendor explicitly name the Intel Compiler as compatible, and if so, with which version(s) have they verified their library?  You should only use the version(s) of the Intel Compiler certified by your vendor.  If you need an older version of the Intel Compiler, please see How do I get an older version of an Intel® Software Development Product.


When all else fails ....

Post a note to the User Forum HERE.   Please include the name of your application if it a commonly available code, post a stack trace (if you can get one), compiler options used, and ideally a tarball of the entire application, input files and instructions on how to run the program.

If you have support for your product, you can open an issue at http://premier.intel.com.

For more background information, try the excellent Dr. Fortran Article "Don't Blow Your Stack!"

And read the PDF presentation attached to this article Fortran Compiler Use of Temporaries: Stack+usage.pdf

 

For more complete information about compiler optimizations, see our Optimization Notice.