Segmentation Fault in MKL PBLAS/ScaLAPACK

Segmentation Fault in MKL PBLAS/ScaLAPACK

Hi,

I am trying to use MKL PBLAS/ScaLAPACK routine as proposed in the following link: http://software.intel.com/en-us/articles/using-cluster-mkl-pblasscalapack-fortran-routine-in-your-c-program. The source code (downloadable from the same site) is also attached to this post.

I am using the Intel® Composer 2011.2.137, compiler icc 12.0.2 20110112, and OpenMPI 1.4.3.

According to the Intel® Math Kernel Library Link Line Advisor I am compiling by

mpicc -w -o pdgemv pdgemv.c -I$(MKLROOT)/include -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -limf -lm -openmp -DMKL_ILP64

Compiling is fine, but running the program via

mpirun -n 4 ./pdgemv

causes the following segmentation fault:

[node266:15074] *** Process received signal ***
[node266:15074] Signal: Segmentation fault (11)
[node266:15074] Signal code: Address not mapped (1)
[node266:15074] Failing at address: 0x44000098
[node266:15074] [ 0] /lib64/libpthread.so.0 [0x3f8420eb10]
[node266:15074] [ 1] /openmpi/1.4.3/intel--co-2011.2.137--binary/lib/libmpi.so.0(MPI_Comm_size+0x5a) [0x2abdef96c17a]
[node266:15074] [ 2] /intel/co-2011.2.137/binary/mkl/lib/intel64/libmkl_blacs_intelmpi_ilp64.so(ilp64_Cblacs_pinfo+0x92) [0x2abdef3be4a2]
[node266:15074] *** End of error message ***

I don't understand what is wrong, hope someone can help me. Thanks and kind regards.

Massi

AllegatoDimensione
Download pdgemv.c2.23 KB
14 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Your code stopped on segfault exception.In your case I think that this Failing at address: 0x44000098 could be either a faulting ip or wrong memory address beign referenced.Probably the address referenced is unreadeable memory or has not been mapped(heap) or has not been commited by your app.

Do you have any updates?

Hi iliyapolak,

thank you for your answer, but I don't understand how can I solve this issue.

I have to add also that I cannot set the linux environment variables as shown in the link I posted:

$source /opt/intel/mkl/10.x.x.0xx/tools/environment/mklvarsem64t.sh
$source /opt/intel/mpi/3.x.x/bin64/mpivars.sh

because I cannot find these paths and files in my linux intel composer version.

Hi Massimiliano,

sorry,but I do not know how to solve it.At least you can ask Intel devs for help.

Btw if you want you can post callstack of the failed process.Maybe we can get some more relevant info regarding the bug.

Hi massi,

I saw you are using OpenMPI, but -lmkl_blacs_intelmpi_ilp64 are for Intel MPI and MPICH2. This may be the cause.

You may try the command like

source /opt/intel/composer_xe_2011.2.137/bin/iccvars.sh intel64

soruce /opt/intel/composer_xe_2011.2.137/mkl/bin/mklvars.sh intel64

(The two commands are the corresponding part of old version of mkl and compiler)

and your openmpi path setting

And the link advisor line:

 $(MKLROOT)/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group  $(MKLROOT)/lib/intel64/libmkl_intel_ilp64.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lpthread -lm -DMKL_ILP64

and let us know how it works.

Best Regards,

Ying

libmkl_blacs_lp64.a

LP64 version ofBLACSroutines supporting the following MPICH versions:

  • Myricom* MPICH version 1.2.5.10

  • ANL* MPICH version 1.2.5.2

libmkl_blacs_ilp64.a

ILP64 version ofBLACSroutines supporting the following MPICH versions:

  • Myricom* MPICH version 1.2.5.10

  • ANL* MPICH version 1.2.5.2

libmkl_blacs_intelmpi_lp64.a

LP64 version ofBLACSroutines supporting Intel MPI and MPICH2

libmkl_blacs_intelmpi_ilp64.a

ILP64 version ofBLACSroutines supporting Intel MPI and MPICH2

libmkl_blacs_intelmpi20_lp64.a

A soft link tolib/intel64/libmkl_blacs_intelmpi_lp64.a

libmkl_blacs_intelmpi20_ilp64.a

A soft link tolib/intel64/libmkl_blacs_intelmpi_ilp64.a

libmkl_blacs_openmpi_lp64.a

LP64 version ofBLACSroutines supporting OpenMPI.

libmkl_blacs_openmpi_ilp64.a

ILP64 version ofBLACSroutines supporting OpenMPI.

libmkl_blacs_sgimpt_lp64.a

LP64 version ofBLACSroutines supporting SGI MPT.

libmkl_blacs_sgimpt_ilp64.a

ILP64 version ofBLACSroutines supporting SGI MPT.

Massi, why not try another variant without ILP64!  is your problem is really huge? you can just try to use the ordinary LP64 version first.

Hi all,

first of all I want to thank you all for your help.

I have followed the hint of Ying H, now I set the environment variables with

source /opt/intel/composer_xe_2011.2.137/bin/iccvars.sh intel64

soruce /opt/intel/composer_xe_2011.2.137/mkl/bin/mklvars.sh intel64

and I link to mkl_blacs_openmpi_ilp64 instead of mkl_blacs_intelmpi_ilp64. I'm wondering about a note given by the Intel® Math Kernel Library Link Line Advisor:

If you are using a non-default MPI, assign the same appropriate value to MKL_BLACS_MPI on all nodes. Set MKL_BLACS_MPI variable to one of the following values: INTELMPI, MPICH2 or MSMPI.

Which value should I set,if I have to, to MKL_BLACS_MPI? Maybe I'm missing some other environment variable?

I still compile without any problem

mpicc -o pdgemv pdgemv.c -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_ilp64 -Wl,--start-group -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_openmpi_ilp64 -Wl,--end-group -liomp5 -lpthread -limf -lm -DMKL_ILP64

but at run time I get the following error from every mpi task:

[node078:02059] *** Process received signal ***

[node078:02059] Signal: Floating point exception (8)
[node078:02059] Signal code: Integer divide-by-zero (1)
[node078:02059] Failing at address: 0x2b8428108a7e

[node078:02059] [ 0] /lib64/libpthread.so.0 [0x3a9a80eb10]
[node078:02059] [ 1] composerxe-2011.2.137/mkl/lib/intel64/libmkl_scalapack_ilp64.so(numroc_+0xe) [0x2b8428108a7e]
[node078:02059] *** End of error message ***

Compiling and running in debug mode I get the following output

{    1,    0}:  On entry to
{    1,    1}:  On entry to
{    0,    0}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    0,    1}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    0,    1}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    1,    1}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    1,    1}:  On entry to
DESCINIT parameter number    6 had an illegal value
DESCINIT parameter number    6 had an illegal value
{    0,    1}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    0,    0}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    0,    0}:  On entry to
DESCINIT parameter number    6 had an illegal value
DESCINIT parameter number    6 had an illegal value
{    1,    0}:  On entry to
DESCINIT parameter number    6 had an illegal value
{    1,    0}:  On entry to
DESCINIT parameter number    6 had an illegal value
[node078:31137] *** Process received signal ***
[node078:31137] Signal: Floating point exception (8)
[node078:31137] Signal code: Integer divide-by-zero (1)
[node078:31137] Failing at address: 0x405f64
[node078:31137] [ 0] /lib64/libpthread.so.0 [0x3a9a80eb10]
[node078:31137] [ 1] pdgemv [0x405f64]
[node078:31137] [ 2] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3a9a01d994]
[node078:31137] [ 3] pdgemv [0x4057c9]
[node078:31137] *** End of error message ***

and analyzing the core file with gdb i get:

Program terminated with signal 8, Arithmetic exception.
#0  0x0000000000405f64 in main (argc=1, argv=0x200000001) at pdgemv.c:87
87                    sat= (myrow*nb)+i+(i/nb)*nb;

because actually nb has been set to 0 by numroc_. According to this I think that I get the warning from descinit as explained in this topic: http://software.intel.com/en-us/forums/topic/293296

I have also tried to change to LP64 as proposed by Gennady Fedorov, but nothing seems to change.

Still I cannot fix this issue...Thank you all again, if you need other informations please ask me!

Ciao!

It seems that either variable "i" or "nb" could be 0.Can you post those values?

My advise is to step-in through the code arround the call site of arithmetic exception and post the result of those two variable mentioned in my previous post.Can you do it with GDB?

The value of nb seems to be modified and set to 0 after calling Cblacs_gridinfo...

so this is the culprit of division by zero exception

I detected two issues in your test case:

...
descinit_( descy, &M, &ONE, &nb, &ONE, &ZERO, &ZERO, &ictxt, &my, &info );
double *x = ( double * )malloc( nx*sizeof(double) );
double *y = ( double * )calloc( my,sizeof(double) );
double *A = ( double* )malloc( mA*nA*sizeof(double) );
...

1. After descinit_ is called there are no any verifications that the call was successful.

2. Your're mixing two memory allocation CRT functions, that is, malloc and calloc. It looks like harmless but I'd like to inform that it is possible pointers x and A could be 64-byte aligned and pointer y could be 32-byte aligned. So, just in case verify alignments of these three pointers.

Accedere per lasciare un commento.