Weird segmentation fault problem...

Weird segmentation fault problem...

I am experiencing a very weird segmentation fault problem with a parallel MPI code compiled with the Intel fortran compiler. The mpi compiler uses ifort version 9.0.026. I have been running this code using 256 processors in parallel on the NCSA Intel Xeon Linux Cluster. Here is the link for this cluster:

Here is my problem: The code sometimes runs just fine on this cluster, but it gives the segmentation fault error right at the beginning of the run at other times. This is very strange. I am running the same exact executable code, so I don't understand why the code would run just fine sometimes and would give segmentation fault at other times. It appears this problem happens only when the job gets assigned to certain nodes of the cluster. The cluster contains a total of 1280 2-processor nodes, so my executable runs on different nodes every time I do a run. I have tried increasing the stacksize to large values but that did not help. I am using the "-O3 -ip -auto -nothreads" flags with the compiler. Also, the mpi compiler I am using is linking my executable with the libpthreads library even though I am not using any threads or OpenMP. The mpi compiler is called cmpif90c which is also known as the ChaMPIon/Pro MPI. This implementation of the MPI is claimed to be thread-safe. I tried to remove the link to the libpthreads library during compilation but that caused an error. Apparently the mpi library needs to be linked to libpthreads. Could the linking with the libpthreads be somehow responsible for the segmentation fault problem? Any suggestions about what else I can try? Thanks!

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Yes, if you have linked dynamically against the pthread (or some other) library, and the library is corrupt, or LD_LIBRARY_PATH not set correctly, on some node, that would cause failure.

Thanks for the information. In that case, I think the most likely scenario is a corrupt libpthread on one of the nodes. I tried linking all the libraries statically but the problem worsened. Previously, the code was giving a segmentation fault on only one node. With the statically linked libpthread, the code gave segmentation fault on all nodes. I guess my only option is to link libpthread dynamically. I have asked the supercomputer support staff to see if they find a corrupt libpthread library on any of the nodes.

I think I have narrowed down the cause of the segmentation fault problem. I recently completed 3 short test runs using 256 processors on Tungsten. Luckily, all the jobs ran on the same nodes.

In the first job, I set LD_ASSUME_KERNEL to 2.2.5
In the second job, I set LD_ASSUME_KERNEL to 2.4.20
Im the third job, I set LD_ASSUME_KERNEL to 2.4.1

The first job immediately gave segmentation fault on all the nodes it ran on, whereas the second and third jobs ran just fine.

The reason I played with the LD_ASSUME_KERNEL environment variable is because I found out that there are 3 versions of the libpthread library in the Redhat Linux OS which the cluster is running. These libraries are located in /lib, /lib/i686 and /lib/tls folders with the corresponding libc and libm libraries. The first job that crashed loaded up the libraries in the /lib folder.

Depending on the value of LD_ASSUME_KERNEL, the system will load up the corresponding version of libpthread. See the Redhat documentation for more info on LD_ASSUME_KERNEL and different versions of libpthread.

Previously, I was not setting this environment variable because I did not know about it. Since the variable was not defined in the previous runs, perhaps some of the nodes in my previous failed jobs were loading the
library in /lib whereas the other nodes were loading one of the other versions. Hopefully setting the LD_ASSUME_KERNEL environment variable to the correct value will fix the segmentation fault problem for good!

Leave a Comment

Please sign in to add a comment. Not a member? Join today