Code works 100% of the time on Mac OS X and only sometimes on Unux

Code works 100% of the time on Mac OS X and only sometimes on Unux

Hi I am by no means a computer scientist and this is my first post. I am a graduate student in chemistry and my lab has 2 pieces of Fortran 90 code which work together to calculate a certain property of a chemical system. The main program which calls the second program as a subroutine to calculate thousands of complex integrals. The main program reads in information about a chemical system from two different files (which are entered as command line options when I run the compiled code). Within the main program there are a series of calculations to determine the size an array needs to be and this number is stored into a sizeVariable. An allocate(array(sizeVariable)) command is then used to allocate the memory for this array. This array is then filled with the proper numbers needed to call the subroutine in the other program to calculate the before mentioned integrals.

So far this code has worked flawlessly when compiled and run on Mac OS X using the intel Composer XE 2011 compiler. Every chemical system works and the correct answer is outputted by the program. The issue is that when I try to run the exact same programs on a Unix machine (the supercomputers I log into) some molecules work perfectly and others will stop doing anything and simply continue to run (doing nothing) until the time I've allotted for that calculation on the supercomputer runs out.

I'm sure the first thing you are thinking is it must be a memory issue. But the size of the chemical system being studied doesn't matter. I have for example a water molecule (1 oxygen atom and 2 hydrogens) the will not work on Unix but systems such as acetic acid (2 carbons, 2 oxygens, and 4 hydrogens) that do work.

Using a series of print commands I've been able to find the spot in the code where it hangs up for those select molecules and it has to do with allocating the array I mentioned at the end of the first paragraph. What I tried next was to change that array from an allocatable array and just allocate it with the proper memory it would need for one of the cases which gets hung up. When I did this, the program returned:

_______________________________________________________________________________________________

*** glibc detected *** ./mem.out: free(): invalid next size (fast): 0x0000000000a51cd0 ***

======= Backtrace: =========

/lib64/libc.so.6(+0x76166)[0x7f75a76eb166]

/lib64/libc.so.6(+0x78c93)[0x7f75a76edc93]

./mem.out[0x470b39]

./mem.out[0x4148de]

./mem.out[0x408331]

./mem.out[0x40314c]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f75a7693d1d]

./mem.out[0x403049]

======= Memory map: ========

00400000-00526000 r-xp 00000000 00:19 160109768                          /home/bsheppard/testifort/mem.out

00725000-0072f000 rw-p 00125000 00:19 160109768                          /home/bsheppard/testifort/mem.out

0072f000-00a0d000 rw-p 00000000 00:00 0 

00a34000-00a55000 rw-p 00000000 00:00 0                                  [heap]

7f75a725b000-7f75a725d000 r-xp 00000000 fd:00 1439054                    /lib64/libdl-2.12.so

7f75a725d000-7f75a745d000 ---p 00002000 fd:00 1439054                    /lib64/libdl-2.12.so

7f75a745d000-7f75a745e000 r--p 00002000 fd:00 1439054                    /lib64/libdl-2.12.so

7f75a745e000-7f75a745f000 rw-p 00003000 fd:00 1439054                    /lib64/libdl-2.12.so

7f75a745f000-7f75a7475000 r-xp 00000000 fd:00 1438978                    /lib64/libgcc_s-4.4.7-20120601.so.1

7f75a7475000-7f75a7674000 ---p 00016000 fd:00 1438978                    /lib64/libgcc_s-4.4.7-20120601.so.1

7f75a7674000-7f75a7675000 rw-p 00015000 fd:00 1438978                    /lib64/libgcc_s-4.4.7-20120601.so.1

7f75a7675000-7f75a7800000 r-xp 00000000 fd:00 1438988                    /lib64/libc-2.12.so

7f75a7800000-7f75a79ff000 ---p 0018b000 fd:00 1438988                    /lib64/libc-2.12.so

7f75a79ff000-7f75a7a03000 r--p 0018a000 fd:00 1438988                    /lib64/libc-2.12.so

7f75a7a03000-7f75a7a04000 rw-p 0018e000 fd:00 1438988                    /lib64/libc-2.12.so

7f75a7a04000-7f75a7a09000 rw-p 00000000 00:00 0 

7f75a7a09000-7f75a7a20000 r-xp 00000000 fd:00 1439012                    /lib64/libpthread-2.12.so

7f75a7a20000-7f75a7c20000 ---p 00017000 fd:00 1439012                    /lib64/libpthread-2.12.so

7f75a7c20000-7f75a7c21000 r--p 00017000 fd:00 1439012                    /lib64/libpthread-2.12.so

7f75a7c21000-7f75a7c22000 rw-p 00018000 fd:00 1439012                    /lib64/libpthread-2.12.so

7f75a7c22000-7f75a7c26000 rw-p 00000000 00:00 0 

7f75a7c26000-7f75a7ca9000 r-xp 00000000 fd:00 1439079                    /lib64/libm-2.12.so

7f75a7ca9000-7f75a7ea8000 ---p 00083000 fd:00 1439079                    /lib64/libm-2.12.so

7f75a7ea8000-7f75a7ea9000 r--p 00082000 fd:00 1439079                    /lib64/libm-2.12.so

7f75a7ea9000-7f75a7eaa000 rw-p 00083000 fd:00 1439079                    /lib64/libm-2.12.so

7f75a7eaa000-7f75a7eca000 r-xp 00000000 fd:00 1439018                    /lib64/ld-2.12.so

7f75a80b8000-7f75a80bd000 rw-p 00000000 00:00 0 

7f75a80c7000-7f75a80c9000 rw-p 00000000 00:00 0 

7f75a80c9000-7f75a80ca000 r--p 0001f000 fd:00 1439018                    /lib64/ld-2.12.so

7f75a80ca000-7f75a80cb000 rw-p 00020000 fd:00 1439018                    /lib64/ld-2.12.so

7f75a80cb000-7f75a80cc000 rw-p 00000000 00:00 0 

7ffff53be000-7ffff53d3000 rw-p 00000000 00:00 0                          [stack]

7ffff53ff000-7ffff5400000 r-xp 00000000 00:00 0                          [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Aborted

_______________________________________________________________________________________________

Could somebody please give me some sort of hint as to why code that works fine on one OS can work sometimes and not others on another OS?

just for completeness here is my compile (fastPosInt.f90 is the main program and positionRR.f90 is the subroutine):

ifort -o pos.out -fpscomp filesfromcmd fastPosInt.f90 positionRR.f90

This compiles fine and here it how it is run:

./pos.out CH4.wfn CH4.dat CH4.coef 3 1 1 1

The pos.out is the executable file, the .wfn and .coef files are the files with the data about the chemical system I mentioned above, the .dat file is there the output is written to and the 3 1 1 1 are just random variables needed to input which can be changed and the same error occurs.

 

I would be very grateful for any advice.

Cheers,

Brendan 

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Without seeing the actual code, it’s hard to guess at what is going on, but you might try, as a starting point, turning on -traceback, which has very little impact on performance or executable/object file size. Next I would also turn on compile time warnings: -warn. These have no performance impact at all, and might help you catch some suspect code. After that I would try running code compiled with -check to see if something fishy is happening like trying to index beyond the bounds of an array.

If memory is a concern the allocate and deallocate statements include optional arguments to check whether the operation was successful.

Another hard to find problem could be from uninitialized variables.

Good luck!

-Zaak

I think Zaak is right. Your program has a bug in it and is writing beyond the end of the array. The O/S that did not complain, did not observe the error in your code. IOW your errant code worked by chance.

Some heap managers, even in release build, will perform a sanity check on the memory nodes freed. They do this in various ways, one way is by inserting a particular bit pattern of data preceding and following the actual allocation. Depending on how old the O/S programmer was, it might be 0xBAADBEEF.  If this number gets stomped on, and/or if the memory node header contains suspicious values, such as your

(*** glibc detected *** ./mem.out: free(): invalid next size (fast): 0x0000000000a51cd0 ***)

Then the heap manager has to fail the free.

Note, the cause of this error need not be an index out of range. It can also be caused by misuse of a pointer (valid or invalid) or stack corruption of the references in the call stack.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today