MPI-IO error when running on lustre with a high number of stripes and processes

MPI-IO error when running on lustre with a high number of stripes and processes

Hi,

I'm trying to run pNetCDF on lustre. The test code and pNetCDF library are both compiled with intel mpi library v4.0.2. Our lustre file system has 40 OSTs.

When running with stripes = 1 or processes = 32, the test codes works well and can output data correctly.

However, when I set stripe = 40 and run with 64 processes, the test code crashed as :

  rank 19 in job 1 c25b09_39645 caused collective abort of all ranks
      exit status of rank 19: killed by signal 9

The test code is attacted. Thank you in advance.

8 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I use GDB to debug and got the error message :

Program received signal SIGFPE, Arithmetic exception.
34: 0x00002aaac33327e0 in ADIOI_LUSTRE_Get_striping_info ()
34: from /apps/intel/impi/4.0.2.003/intel64/lib/libmpi_lustre.so

It seems to be the same with http://lists.mcs.anl.gov/pipermail/mpich-discuss/2010-September/007947.html.

Is it a bug in Intel mpi library v4.0.2, And has it been fixed in new version?

Hi Wencan,

I can't find any indication of this being a known issue.  There is an issue related to Lustre in the latest version that might cause a problem for you (undefined symbol in one of our libraries).  I would recommend trying version 4.0.3 first, and 4.1.0.030 if 4.0.3 does not work.  Please let me know if you try any and what the results are.

Can you please attach your test code?  It did not get properly attached to the first post.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Thanke you for your help.

附件: 

附件尺寸
下载 perform-test-pnetcdf.c2.54 KB

Hi Wencan,

I am only able to set striping up to 18 on the cluster I am using.  At 18 stripes, I am unable to reproduce this behavior.  Please run with I_MPI_DEBUG=5 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi.

I ran with 64 processes and got the output, which is attacted.

The error occurs when the program runs in loop 1 and outputs the 2nd netCDF file. 

Thank you for your help.

附件: 

附件尺寸
下载 pnetcdf-64-out.txt23.49 MB

Hi Wencan,

It appears you are using LSF* as your job scheduler.  We do have some known issues with LSF*.  I don't think they're related to this case, but can you try a few things just in case?  First, try running in an interactive job.  Due to one of the known issues, you will probably need to add

-genv LD_LIBRARY_PATH $LD_LIBRARY_PATH

to your mpirun command.  Also, please try running completely outside of LSF*.

Could you also send the output from stderr (for a failing job)?  It would be best if you have stdout and stderr in the same file.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi,

I ran with mpiexec comman as:

mpiexec -genv I_MPI_EXTRA_FILESYSTEM on -genv I_MPI_EXTRA_FILESYSTEM_LIST lustre -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH $LD_LIBRARY_PATH -n $num ./perform_test_pnetcdf $x_proc $y_proc $output &> $out

and got the new output.

附件: 

附件尺寸
下载 pnetcdf-64-out.txt23.49 MB

发表评论

登录添加评论。还不是成员?立即加入