forrtl: severe (174): SIGSEGV, segmentation fault occurred

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Hi!

I compile together two sections of a model (one in c++ and the other in fortran) using a makefile. I have already use the model before and apparently everything is normal during the compilation. however, for some reason it doesn't seem to be able to run the executables created. It starts without any problem but suddenly it gives the error I used for title.

This seems to be a quite usual problem when searching through google, but so far I haven't found a way to fix mine. I increased the stack size to unlimited using

ulimit -s unlimited

I also upgraded my linux distribution to 64 bit arquitecture, as well as the fortran and c++ intel compilers and nothing worked. The program hasn't changed, neither the data, and still works without any problem in other computer. What has changed is the "output frequency", but I tried with the one I used before and I was still having the same problem. I had to reinstall the compilers and the netcdf libraries, but there weren't any problems during the installation.

This is what I get when runnning it:

[ascotilla@ascotilla-HP program_files]$ ./motif-step1b
5Total land points:  61538
Spinup read from:   /media/Data/Outputs/data_after_step1a.txt
Spinup years:       1000
Spinup output freq: n/a
Rampup written too: /media/Data/Outputs/data_after_step1b.txt
Rampup years:       1000
Rampup output freq: 20
Run years:          156
Run output freq:    3
  out_years:  5
  from year:  56
output files in working directory
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libnetcdf_c++.so.  00002B2412F520A5  Unknown               Unknown  Unknown
libnetcdf_c++.so.  00002B2412F557D5  Unknown               Unknown  Unknown
motif-step1b   000000000044CB9A  Unknown               Unknown  Unknown
motif-step1b   000000000044BDDE  Unknown               Unknown  Unknown
motif-step1b   0000000000409A72  Unknown               Unknown  Unknown
motif-step1b   000000000040510C  Unknown               Unknown  Unknown
libc.so.6          0000003DFAE2169D  Unknown               Unknown  Unknown
motif_lpj-step1b   0000000000405009  Unknown               Unknown  Unknown
[ascotilla@ascotilla-HP program_files]$

I checked for the libnetcdf_c++.so library and it's in the /usr/lib and the /usr/lib64 folders, so doesn't seem to be a problem with not finding the path to them. It isn't a permission problem either, as I executed the program as superuser and didn't change.

I'm quite a newbie with this, so I still don't know very well where else to look. Any ideas, suggestions, etc are more than welcome

Thanks in advance

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I used -traceback in the ifort options to have an idea of where the problem was coming from, and it seems that the program still calls a routine that has been commented out in previous versions of the model. However, I always used the same model, and that didn't change. I even recheck previous versions I have and I always worked with the same files, and never had that problem before. Does it make any sense?

Read this article: Diagnosing Seg Fault/Bus Error/SIGSEGV errors and take the prescribed steps to track down the cause of the seg-fault.

Quite likely, there is an error in the arguments passed from motif_lpj-step1b to the Netcdf C++ library or, less likely, there is an error in the library itself. In either case, localizing the error by having the traceback printout show the routine name and line numbers instead of machine addresses will be helpful.

Your finding that "the program still calls a routine that has been commented out in previous versions of the model" is something that you should investigate thoroughly.

Your long descriptions of actions that you took (such as reinstalling the OS and compilers) serve only to strengthen the suspicion raised above. None of those actions will remove a seg-fault caused by errors in your code or in the Netcdf libraries.

Forget the last post, the subroutine is in the c++ section of the model, but the -traceback option doesn't give any additional information for the c++ bit. My first guess was that there might be a problem in linking the main program (in fortran) with the c++ (input-output) part, but it doesn't seem to have any problems with the other subroutines. I commented out the call to the subroutine and the model runs properly, but the subroutine controls the years output by the model and I'm not getting what I want...:-S

I have just seen this, thank you for the article!
I'll let you know if I make any advances

I was checking some of the options given by the article and checking my memory usage I got this:

cat /proc/meminfo

MemTotal:        8031844 kB
MemFree:          150952 kB
Buffers:         2760420 kB
Cached:          3642636 kB
SwapCached:          792 kB
Active:          3266552 kB
Inactive:        4080524 kB
Active(anon):     802716 kB
Inactive(anon):   186428 kB
Active(file):    2463836 kB
Inactive(file):  3894096 kB
Unevictable:          84 kB
Mlocked:              64 kB
SwapTotal:      10125308 kB
SwapFree:       10111464 kB
Dirty:                36 kB
Writeback:             0 kB
AnonPages:        943360 kB
Mapped:           119792 kB
Shmem:             45104 kB
Slab:             234008 kB
SReclaimable:     193852 kB
SUnreclaim:        40156 kB
KernelStack:        2968 kB
PageTables:        32820 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    14141228 kB
Committed_AS:    2402100 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      313368 kB
VmallocChunk:   34359388340 kB
HardwareCorrupted:     0 kB
AnonHugePages:    241664 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     2091008 kB
DirectMap2M:     6223872 kB

It seems to me that I don't have enough free memory...I guess that should explain why it's not working....

Ok...I know, I should give my thoughts time enough to ripe before posting...

Following mecej's article I aded the option -check bounds -g to the compilers. It can also be done, in my case when executing the makefile:

make clean
make BOUNDS=yes TRACEBACK=yes

This gives me another error different from the one I was having until now:

forrtl: severe (408): fort: (3): Subscript #1 of the array DVAL_PREC has value 0 which is less than the lower bound of 1

Following this other site: http://wiki.seas.harvard.edu/geos-chem/index.php/Common_GEOS-Chem_error_..., I did:

grep -i DVAL_PREC  *.f*

and I got:

main.f:      subroutine prdaily(mval_prec,dval_prec,mval_wet,year)
main.f:      real mval_prec(nmonth),dval_prec(ndayyear),mval_wet(nmonth)
main.f:                if(dval_prec(day-1).lt.0.1) then
main.f:                dval_prec(day)=0.0
main.f:                dval_prec(day)=((-alog(v1))**c2)*mprec(m)*c1
main.f:                if(dval_prec(day).lt.0.1) dval_prec(day)=0.0
main.f:              mprecip(m)=mprecip(m)+dval_prec(day)
main.f:              dval_prec(day)=dval_prec(day)*(mval_prec(m)/mprecip(m))
main.f:              if (dval_prec(day).lt.0.1) dval_prec(day)=0.0
main.f:c              dval_prec(day)=mval_prec(m)/ndaymonth(m)  !no generator
main.f:c                    dval_prec(day)=mprec(m)
main.f:c                    dval_prec(day)=0.0
main.f~:      subroutine prdaily(mval_prec,dval_prec,mval_wet,year)
main.f~:      real mval_prec(nmonth),dval_prec(ndayyear),mval_wet(nmonth)
main.f~:                if(dval_prec(day-1).lt.0.1) then
main.f~:                dval_prec(day)=0.0
main.f~:                dval_prec(day)=((-alog(v1))**c2)*mprec(m)*c1
main.f~:                if(dval_prec(day).lt.0.1) dval_prec(day)=0.0
main.f~:              mprecip(m)=mprecip(m)+dval_prec(day)
main.f~:              dval_prec(day)=dval_prec(day)*(mval_prec(m)/mprecip(m))
main.f~:              if (dval_prec(day).lt.0.1) dval_prec(day)=0.0
main.f~:c              dval_prec(day)=mval_prec(m)/ndaymonth(m)  !no generator
main.f~:c                    dval_prec(day)=mprec(m)
main.f~:c                    dval_prec(day)=0.0

I guess the problem comes from the definition of dval_prec=0.0, but I don't see why it's a problem..I didn't create the program (quite obviously..:-P) and I really don't dare to change the code unless I'm completely sure that it will do exactly the same (I mean in terms of performance, I'd love to get rid of the errors for good)...

Any ideas?


You did not display the source line number that the runtime subscript check would have displayed. That information and the source line with that line number would tell you exactly what caused the error.

Setting dval_prec(day) = 0.0 is not the problem. Rather, the line numbered 3. in your post shows

if(dval_prec(day-1).lt.0.1)then

If day has the value 0 or 1, a subscript error occurs here, since the implied lower bound of the array is 1.

Yes, sorry... Actually, the source of error is in the line 1360 of the fortran part. I got this just after the error:

libnetcdf_c++.so.  00002B07ADEF80A5  Unknown               Unknown  Unknown
libnetcdf_c++.so.  00002B07ADEFB7D5  Unknown               Unknown  Unknown
motif-step1b   000000000044C4D0  Unknown               Unknown  Unknown
motif-step1b   000000000044A526  Unknown               Unknown  Unknown
motif-step1b   000000000040999F  MAIN__                   1360  main.f
motif-step1b   000000000040512C  Unknown               Unknown  Unknown
libc.so.6          0000003DFAE2169D  Unknown               Unknown  Unknown
motif-step1b   0000000000405029  Unknown               Unknown  Unknown

The line 1360 of the main program is a call to a subroutine in C++, but it doesn't give any information about where inside the subroutine the problem is. It was following the instructions from the article and the website that I found out that it seems to be a problem of the array being out of bounds.

As I said, I'm a bit scared with touching the code, as it has already worked for me and for other people and since it's a complex model any variation can alterate the results a lot.

I'm not an expert, but it seems weird to me that the value of "day" will affect the result..if day=0 or 1 it will just accept the second line of the condition, won't it?... In this case I think (although I'm not sure) that it refers to rainy days...and having days with no rain is quite important for the model...

I'll keep on looking for it, but any ideas will be helpful, really...

Cheers

Given the declaration

dval_prec(ndayyear)

and the integer variable day, the reference

dval_prec(day-1)

is illegal for values of (day -1 ) < 1 or > ndayyear, and the behavior of a program that uses array subscripts out of bounds is undefined. This is irrespective of whatever physical or logical significance the offending subscript may have for you.

The code has a bug that needs to be fixed.

Thanks again..I'll check the variable "day", to see from which value it starts...

As mecej4 said, there's been a problem in the spinup and there are negative numbers where they shouldn't be (in the input file created by the model in the previous step). I'm having a look more closely to check where the problem is exactly coming from, with a colleague of mine who is more used to the code.

>>forrtl: severe (408): fort: (3): Subscript #1 of the array DVAL_PREC has value 0 which is less than the lower bound of 1
.AND.
>>main.f:realmval_prec(nmonth),dval_prec(ndayyear),mval_wet(nmonth)
.AND.
>>main.f:if(dval_prec(day-1).lt.0.1)then

The array dval_prec has the array bounds of (1:ndayyear)

Should day represent a 1 based day of year (reasonable assumption) then the dval_prec(day-1) will be wrong when day == 1 (first day of year). Without examining your code the test seems like it is expecting to use the prior day's value of dval_prec. Consider using:

day_prior = day - 1
if(day_prior == 0) day_prior = ndayyear
! *** caution, you may have to account for leap year
if(dval_prec(day_prior).lt.0.1) then

Jim Dempsey

www.quickthreadprogramming.com

Sorry for taking so much time to answer. We've been checking the code with some of the people who wrote it and it was, as you said, a bug. However it wasn't in that part of the code, but in the c++ bit.I've been not able to reproduce de d_val_prec error again. A few things were changed and now it works, and probably where it was fixed it (a -2 that had to be a -1 in one of the formulas) was where the out of bounds problem was coming from, although not directly.Thank you mecej4 and jimdempseyatthecove for your helpCheers

Although one tends to be thankful when a software bug commits suicide, I think that you may find it worthwhile to pin down and document the changes that were made.

At this point the details of the section of code that gave you problems are fresh in your mind, so documenting the bug should be easier now. If/when the bug comes alive again, you will be glad that you took the time to document it.

Leave a Comment

Please sign in to add a comment. Not a member? Join today