Scalapack raise error under certain circumstance

Scalapack raise error under certain circumstance

Dear All,

      I am using IntelMPI + ifort + MKL to compile Quantum-Espresso 6.1. Everthing works fine except invoking scalapack routines. Calls to PDPOTRF may exit with non-zero error code under certain circumstance. In an example, with 2 nodes * 8 processors per node the program works but with 4 nodes * 4 processors per node the program fails. If I_MPI_DEBUG is used,  for the failed case there are following messages just before the call exit with code 970, while for the working case there is no such messages:

[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676900, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675640, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x26742b8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676b58, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x26769c8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676c20, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675fa0, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676068, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676a90, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676e78, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2678778, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675898, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675a28, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675bb8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674f38, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676ce8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676130, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674768, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674448, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674b50, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675e10, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675708, operation 2, size 2300, lkey 1879682311

         Could you provide any suggestion about what is the possible cause here? Thanks very much.

Feng

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

According to your debug message, it probably caused by MPI. I tried to find the meaning of  MRAILI_Ext_sendq_send, it seems like a class defined by MPICH that Intel MPI also related to. I will transfer your problem to cluster forum. Thanks.

Best regards,
Fiona

Hi Feng,

can you please inform the version of Intel MPI Library, your MPI environment "env | grep I_MPI", and if possible, can you please also share your debug file? You can send the logs via the private message if you think you do not want share here, or you can also submit the issue at Online Service Center (supporttickets.intel.com/).

Thanks.

Zhuowei

Quote:

Si, Zhuowei (Intel) wrote:

Hi Feng,

can you please inform the version of Intel MPI Library, your MPI environment "env | grep I_MPI", and if possible, can you please also share your debug file? You can send the logs via the private message if you think you do not want share here, or you can also submit the issue at Online Service Center (supporttickets.intel.com/).

Thanks.

Zhuowei

 

Hi Zhuowei,

     This is Parallel Studio XE 2017sp1, ifort version 17.0.0.098. There is nothing in env except I_MPI_ROOT. The debug file is very large so I sent it by pm. I could not open the Online Service Center as after login it keeps redirecting me to a website ends with "null" in the whole day, but I have just successfully login and submitted the issue. Thanks very much.

Feng

Leave a Comment

Please sign in to add a comment. Not a member? Join today