Weird seg fault during sqrt

Weird seg fault during sqrt

Ritratto di hallevison

Hi Everyone:

I am getting a seg fault that I don't understand.  It occurs on the last line of the following code fragment:

cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then  
             write(*,*) 'Here #DMM30 ',vrel2
          endif
c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

         vrel = sqrt(vrel2)

vrel and vrel2 are simple real*8's that are local to the subroutine.  The code is compiled with the -O -CB -traceback -fpe0 -recursive -openmp flags.  The code is parallelized with OMP but this part of the code is not in a parallel loop.  It IS at the bottom of a recursive loop, however. The code runs for about 12 hours before the error occurs and the subroutine is called many billions of times.  The traceback looks like:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
stryAFdebug        00000000004B57B4  Unknown               Unknown  Unknown
libpthread.so.0    0000003F4820EEE0  Unknown               Unknown  Unknown
stryAFdebug        00000000005405DB  Unknown               Unknown  Unknown
stryAFdebug        0000000000533D85  Unknown               Unknown  Unknown
stryAFdebug        00000000004823F3  discard_mass_merg        6507  stryAFdebug.f
stryAFdebug        0000000000478870  symba6_merge_            4454  stryAFdebug.f
stryAFdebug        000000000047564F  symba6_step_recur       10729  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        000000000047537A  symba6_step_recur       10712  stryAFdebug.f
stryAFdebug        00000000004776CC  symba6_step_recur       10614  stryAFdebug.f
stryAFdebug        0000000000474A4A  symba6_step_inter        5085  stryAFdebug.f
stryAFdebug        00000000004A8C1B  symba6_step_pl_           183  symba6_step_pl.f
stryAFdebug        000000000040B0E2  MAIN__                    661  stryAFdebug.f
stryAFdebug        000000000040476C  Unknown               Unknown  Unknown
libc.so.6          0000003F4762135D  Unknown               Unknown  Unknown
stryAFdebug        0000000000404669  Unknown               Unknown  Unknown

I would appreciate any insight into what is going on.

9 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di HeinzB (Intel)

I assume, your application exceeds the available stack size.. Remove any limit for the the stack of the main thread ( Linux Bash Shell:" ulimit -s unlimited").

You mention, that the code is not in an OpenMP-parallel section. But to be sure, please use environment variable KMP_STACKSIZE ( identical to OMP_STACKSIZE) to increase the stack size of the OpenMP threads to a value like 32 MB or more.. There are multiple discussions in this forum about the topic: Search for 'KMP_STACKSIZE'. See too the compiler manual.

Don't interpret too much from the source line shown by the traceback: For an optimized application, this not necessarily needs to be correct due to the transformations done by the compiler.

Ritratto di hallevison

Thanks for the quick reply. I already have set the stacksize to unlimited and have set KMP_STACKSIZE to 64m. I will try to increase KMP_STACKSIZE to see if it makes a difference.

Ritratto di hallevison

Hi:

I just wanted to leave a note concerning the resolution of this
issue, particularly because the compiler is not behaving
properly, I believe. First, let me point out that I made a
mistake in my original posting. vrel2 is a real*16 and not a
real*8. vrel is still a real*8, however.

First, I played with the value of KMP_STACKSIZE a bit and it made
no difference. However, the flags DID make a difference. The
code runs fine if I do not set -fpe0, but gives a seg fault if
this flag is set.

I then modified the code:

real*8 vrel2_8,vrel
real*16 vrel2

cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then ! 2nd checks for NaN
write(*,*) 'Here #DMM30 ',vrel2
endif
open(77,file='DMM30.dat')
write(77,*) vrel2
write(77,*) 'This is before the sqrt of real*8'
vrel2_8 = vrel2
vrel = sqrt(vrel2_8)

write(77,*) 'This is after the real*8 sqrt'
write(77,*) vrel
write(77,*) 'This is before the real*16 sqrt'
c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vrel = sqrt(vrel2)

cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
write(77,*) vrel
write(77,*) 'This is after the sqrt of real*16'
close(77)
c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The traceback still points to the line that contains vrel =
sqrt(vrel2). The file DMM30.dat has:

0.242983437048908040845063283086347
This is before the sqrt of real*8
This is after the real*8 sqrt
0.492933501649977
This is before the real*16 sqrt

So, clearly vrel2 contains a reasonable number (which is good for
me) and if I move the real*16 to a real*8 before I take the sqrt the
code is fine. It dies when I take a sqrt of a real*16 and try to put
in a real*8.

Ritratto di HeinzB (Intel)

I tried you sample code using different compiler versions includng 12.1 amd 13.0 but I don't get a fault and the values written to DMM30.dat are always correct. I have used the options you list above (-O -CB -traceback -fpe0 -recursive -openmp ) but also many other combinations.

Please let me know exactly which compiler version you use ( ifort -V) and the complete option list you took for the sample
Thanks
Heinz

Ritratto di Tim Prince

if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then

I would guess that the comparison vrel2.ne.vrel2 is replaced at compile time by .false. if you left default optimizations in effect, although your later indication of mixed data types makes this less certain.
One of the intrinsics ieee_is_nan or the non-standard legacy equivalent seems more likely to carry out the apparent intent.
If you are trying to make this portable to f95 compilers, you will likely need conditional compilation. The ifort directive !dir$ optimize(0) may work, but certain compilers like Open64 will error out on that.

Ritratto di hallevison

Thanks for the quick response.

hal@marvin ~] ifort -v
Version 11.1

The code was comiled with -traceback -fpe0 -recursive -openmp options

As for the (vrel2.ne.vrel2) - I thought that this is the standard way of determineing whether a number is NaN, but I could be wrong. In any case, I included this code as part of my debuging and it will not part of the production version.

Thanks again

Ritratto di hallevison

One more thing. These lines of code are called many many times before the error occurs.

Ritratto di HeinzB (Intel)

Thanks for the update. I assume then, the exception occurs only for certain FP values. I only tested this for some constants like 4.0D0 etc. I will look at it again
Heinz

Accedere per lasciare un commento.