Co-arrays: rank causes collective abort

Co-arrays: rank causes collective abort

For a statistical application, I run a set of bootstrap replications, which are completely independent iterations ("embarrassingly parallel") of a computation procedure using random weights. Using co-arrays (shared memory, -coarray=shared), I divide these iterations over different images, each of which carries out a do-loop of iterations. Finally, I merge the statistics computed in all images to compute summary statistics.This works fine for a small number of replications; my program finishes without errors. If I increase the replications, the program stops with the following error:"rank 8 in job 1 ### (deleted host name) caused collective abort of all ranks exit status of rank 8: killed by signal 7" (rank and signal can vary)Intuitively, I would suspect a memory problem, but I've tried using the ulimit -s unlimited option to increase the stack space (as well as setting a number like 999999999), increasing the stack size (e.g, export OMP_STACKSIZE=32g - is this just for OpenMP?), and putting automatic arrays and arrays created for temporary computations on the heap instead of the stack (-heap-arrays), but none of this helped.Is it possible that this is an MPI error somehow? It's also weird that the images are carrying out their do-loops at very different speeds. In one example, one of the images managed almost twice as many iterations as another, before the program aborted, and there is no mathematical reason that speed could differ. The image causing the abort (it's [rank]+1, correct, because the ordering starts at 0?) had average speed.Are there any obvious error sources or known issues that could apply here? Unfortunately, I'm not really at liberty to make our code available here, but I would like to try to supply as much information as necessary to work this out. Some of the co-arrays I use are allocatables; I read about possible memory leak using allocatables: http://objectmix.com/fortran/243080-allocatable-components-derived-type.... Could this be an issue here?Thanks!Intel 12.3.174 on Unix,shared memory

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I moved this to the Linux forum.

First, I would ask that you upgrade to 12.1.5, as we are continually fixing issues with coarrays. In general, though, the MPI error you have is usually a consequence of something else going wrong. Have you enabled array bounds checking?

It would be most helpful if you could provide a complete example that reliably demonstrates the problem.

Steve

Login to leave a comment.